cuda Assembly:PTX & SASS
两种汇编
- parallel thread execution (PTX) 内联汇编有没有关系
- PTX是编程人员可以操作的最底层汇编,原因是SASS代码的实现会经常根据GPU架构而经常变换
- https://docs.nvidia.com/cuda//pdf/Inline_PTX_Assembly.pdf
- ISA指令手册 https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set
- SASS
- Streaming ASSembly(Shader Assembly?) 没有官方的证明
- 没有官方详细的手册,有基本介绍:https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#ampere
- https://zhuanlan.zhihu.com/p/161624982
- 从可执行程序反汇编SASS
SASS 指令基本信息
对于Ampere架构
指令方向
1 | (instruction) (destination) (source1), (source2) ... |
各种寄存器说明
RX
for registersURX
for uniform registersSRX
for special system-controlled registersPX
for predicate registersc[X][Y]
for constant memory
SASS 举例说明1
SASS的难点在于指令的后缀。由于手册确实,需要结合PTX的后缀查看
1 | /*0028*/ IMAD R6.CC, R3, R5, c[0x0][0x20]; |
line1
1 | /*0028*/ IMAD R6.CC, R3, R5, c[0x0][0x20]; |
Extended-precision integer multiply-add: multiply R3 with R5, sum with constant in bank 0, offset 0x20, store in R6 with carry-out.
c[BANK][ADDR] is a constant memory。
.CC
means “set the flags”
line2
1 | /*0030*/ IMAD.HI.X R7, R3, R5, c[0x0][0x24]; |
Integer multiply-add with extract: multiply R3 with R5, extract upper half, sum that upper half with constant in bank 0, offset 0x24, store in R7 with carry-in.
line3
1 | /*0040*/ LD.E R2, [R6]; //load |
LD.E is a load from global memory using 64-bit address in R6,R7(表面上是R6,其实是R6 与 R7 组成的地址对)
summary
1 | R6 = R3*R5 + c[0x0][0x20], saving carry to CC |
寄存器是32位的原因是 SMEM的bank是4字节的。c数组将32位的基地址分开存了。
first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20],
leaving 64-bit address result in the R6,R7 pair
对应的代码是
1 | kernel f (uint32* x) // 64-bit pointer |
SASS Opt Code分析2
- LDG - Load form Global Memory
- ULDC - Load from Constant Memory into Uniform register
- USHF - Uniform Funnel Shift (猜测是特殊的加速shift)
- STS - Store within Local or Shared Window
流水STS
观察 偏移
- 4
- 2060(delta=2056)
- 4116(delta=2056)
- 8228(delta=2 * 2056)
- 6172(delta=-1 * 2056)
- 10284(delta=2 * 2056)
- 12340(delta=2056)
可见汇编就是中间写反了,导致不连续,不然能隐藏更多延迟
STS缓存寄存器来源
那么这些寄存器是怎么来的呢?感觉就是写反了
1 | IMAD.WIDE.U32 R16, R16, R19, c[0x0][0x168] |
Fix
原因是前面是手动展开的,假如等待编译器自动展开for循环就不会有这个问题
需要进一步的研究学习
暂无
遇到的问题
暂无
开题缘由、总结、反思、吐槽~~
参考文献
https://forums.developer.nvidia.com/t/solved-sass-code-analysis/41167/2
cuda Assembly:PTX & SASS
http://icarus.shaojiemike.top/2022/05/22/Work/Programming/2.1-Assembly/PTX_SASS/