Posted 2025-02-20Updated 2026-01-08Artificial Intelligence5 minutes read (About 763 words)

250217 Step-Video-T2V Reading & Porting

导言

阅读Step-Video-T2V代码（git id d3ca3d6），移植到昇腾。

框架特点

api/call_remote_server.py 展示了vae，text-encoder在不同GPU分离的计算系统。

JSON配置

step_llm（text encoder）

{
    "_name_or_path": "/mnt/shared-storage/tenant/opensource/step_llm",
    "allow_transformer_engine": false,
    "architectures": [
        "Step1Model"
    ],
    "attention_dropout": 0.0,
    "attention_impl": "GQA",
    "base_batch_size": 128,
    "embedding_weights_in_fp32": false,
    "ffn_hidden_size": 16896,
    "fp32_residual_connection": false,
    "hidden_dropout": 0.0,
    "hidden_size": 6144,
    "kv_channels": 128,
    "layernorm_epsilon": 1e-05,
    "max_position_embeddings": 16384,
    "num_attention_groups": 8,
    "num_attention_heads": 48,
    "num_layers": 48,
    "orig_vocab_size": 65536,
    "overlap_p2p_comm": true,
    "padded_vocab_size": 65536,
    "params_dtype": "torch.bfloat16",
    "seq_length": 16384,
    "swiglu_recompute_silu_dot": true,
    "tokens_to_generate": 512,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.48.3",
    "use_flash_attn": true,
    "virtual_pipeline_model_parallel_size": 3
}

transformer

{
    "_class_name": "StepVideoModel",
    "_diffusers_version": "0.31.0",
    "attention_head_dim": 128,
    "attention_type": "parallel",
    "caption_channels": [
        6144,
        1024
    ],
    "dropout": 0.0,
    "in_channels": 64,
    "norm_elementwise_affine": false,
    "norm_eps": 1e-06,
    "norm_type": "ada_norm_single",
    "num_attention_heads": 48,
    "num_layers": 48,
    "out_channels": 64,
    "patch_size": 1,
    "use_additional_conditions": false
}

flow matching scheduler

{
    "_class_name": "FlowMatchDiscreteScheduler",
    "_diffusers_version": "0.31.0",
    "device": null,
    "num_train_timesteps": 1000,
    "reverse": false,
    "solver": "euler"
}

VAE

from stepvideo.vae.vae import AutoencoderKL
vae 约 1、2GB ^3
如何接入DiT流程的？如何逐步替换OSP1.5

1
2
3

def decode_vae(self, samples):
    samples = asyncio.run(self.vae(samples.cpu()))
    return samples

关于双路径的开启

1 2	use_conv_shortcut version == 2 # 开启？

text encoder

stepllm（约40GB）和 HunyuanClip（4GB）^3 融成了self.caption
prompt 通过 encode_prompt 产生 encoder_hidden_states 经过 caption_projection

def build_llm(self, model_dir):
    from stepvideo.text_encoder.stepllm import STEP1TextEncoder
    text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
    print("Inintialized text encoder...")
    return text_encoder
    
def build_clip(self, model_dir):
    from stepvideo.text_encoder.clip import HunyuanClip
    clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
    print("Inintialized clip encoder...")
    return clip

def embedding(self, prompts, *args, **kwargs):
    with torch.no_grad():
    try:
            y, y_mask = self.text_encoder(prompts)
                
            clip_embedding, _ = self.clip(prompts)
            
            len_clip = clip_embedding.shape[1]
            y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1)   ## pad attention_mask with clip's length 

            data = {
                'y': y.detach().cpu(),
                'y_mask': y_mask.detach().cpu(),
                'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
            }

            return data
    except Exception as err:
            print(f"{err}")
            return None

DiT

StepVideoModel
怎么接入text—encoder的；url的接口，怎么改回去？
- 把 encode_prompt 里的 asyncio.run 改回去即可。
xfuser并行库？是否要去掉
flowMatch scheduler: FlowMatchDiscreteScheduler,

模型diffusion_pytorch_model-00001-of-00006.safetensors约58GB^3

适配流程

先训练再推理
将推理硬编码的超参变成json里可选的超参，并且各个组件传递tensor时要对齐。
1. 问题：VAE和DiT的channel怎么对齐的？超参正好都是8，predictor:in_channels == ae:latent_dim == 8
2. 超参改变权重还能加载吗？
推理：由于权重的超参是固定的，还不一定能跑得下。

参考文献

250217 Step-Video-T2V Reading & Porting

http://icarus.shaojiemike.top/2025/02/20/Work/Artificial Intelligence/Model/T2I2V/250217StepVideoT2V/

Author

Shaojie Tan

Posted on

2025-02-20

Updated on

2026-01-08

Licensed under

#t2v

250217 Step-Video-T2V Reading & Porting

框架特点

JSON配置

VAE

text encoder

DiT

适配流程

参考文献

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Catalogue

Categories

Subscribe for updates

follow.it

Links

Recents

Archives

Tags