250217 Step-Video-T2V Reading & Porting

导言

阅读Step-Video-T2V代码(git id d3ca3d6),移植到昇腾。

框架特点

api/call_remote_server.py 展示了vae,text-encoder在不同GPU分离的计算系统。

JSON配置

step_llm(text encoder)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"_name_or_path": "/mnt/shared-storage/tenant/opensource/step_llm",
"allow_transformer_engine": false,
"architectures": [
"Step1Model"
],
"attention_dropout": 0.0,
"attention_impl": "GQA",
"base_batch_size": 128,
"embedding_weights_in_fp32": false,
"ffn_hidden_size": 16896,
"fp32_residual_connection": false,
"hidden_dropout": 0.0,
"hidden_size": 6144,
"kv_channels": 128,
"layernorm_epsilon": 1e-05,
"max_position_embeddings": 16384,
"num_attention_groups": 8,
"num_attention_heads": 48,
"num_layers": 48,
"orig_vocab_size": 65536,
"overlap_p2p_comm": true,
"padded_vocab_size": 65536,
"params_dtype": "torch.bfloat16",
"seq_length": 16384,
"swiglu_recompute_silu_dot": true,
"tokens_to_generate": 512,
"torch_dtype": "bfloat16",
"transformers_version": "4.48.3",
"use_flash_attn": true,
"virtual_pipeline_model_parallel_size": 3
}

transformer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
"_class_name": "StepVideoModel",
"_diffusers_version": "0.31.0",
"attention_head_dim": 128,
"attention_type": "parallel",
"caption_channels": [
6144,
1024
],
"dropout": 0.0,
"in_channels": 64,
"norm_elementwise_affine": false,
"norm_eps": 1e-06,
"norm_type": "ada_norm_single",
"num_attention_heads": 48,
"num_layers": 48,
"out_channels": 64,
"patch_size": 1,
"use_additional_conditions": false
}

flow matching scheduler

1
2
3
4
5
6
7
8
{
"_class_name": "FlowMatchDiscreteScheduler",
"_diffusers_version": "0.31.0",
"device": null,
"num_train_timesteps": 1000,
"reverse": false,
"solver": "euler"
}

VAE

  • from stepvideo.vae.vae import AutoencoderKL
  • vae 约 1、2GB ^3
  • 如何接入DiT流程的?如何逐步替换OSP1.5
1
2
3
def decode_vae(self, samples):
samples = asyncio.run(self.vae(samples.cpu()))
return samples

关于双路径的开启

1
2
use_conv_shortcut
version == 2 # 开启?

text encoder

  • stepllm(约40GB) 和 HunyuanClip(4GB)^3 融成了self.caption
  • prompt 通过 encode_prompt 产生 encoder_hidden_states 经过 caption_projection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def build_llm(self, model_dir):
from stepvideo.text_encoder.stepllm import STEP1TextEncoder
text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
print("Inintialized text encoder...")
return text_encoder

def build_clip(self, model_dir):
from stepvideo.text_encoder.clip import HunyuanClip
clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
print("Inintialized clip encoder...")
return clip

def embedding(self, prompts, *args, **kwargs):
with torch.no_grad():
try:
y, y_mask = self.text_encoder(prompts)

clip_embedding, _ = self.clip(prompts)

len_clip = clip_embedding.shape[1]
y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1) ## pad attention_mask with clip's length

data = {
'y': y.detach().cpu(),
'y_mask': y_mask.detach().cpu(),
'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
}

return data
except Exception as err:
print(f"{err}")
return None

DiT

  • StepVideoModel
  • 怎么接入text—encoder的;url的接口,怎么改回去?
    • 把 encode_prompt 里的 asyncio.run 改回去即可。
  • xfuser并行库?是否要去掉
  • flowMatch scheduler: FlowMatchDiscreteScheduler,

模型diffusion_pytorch_model-00001-of-00006.safetensors约58GB^3

适配流程

  1. 先训练再推理
  2. 将推理硬编码的超参变成json里可选的超参,并且各个组件传递tensor时要对齐。
    1. 问题:VAE和DiT的channel怎么对齐的?超参正好都是8,predictor:in_channels == ae:latent_dim == 8
    2. 超参改变权重还能加载吗?
  3. 推理:由于权重的超参是固定的,还不一定能跑得下。

参考文献

Author

Shaojie Tan

Posted on

2025-02-20

Updated on

2025-11-20

Licensed under