SHAOJIE'S BOOK

Posted 2022-05-22Updated 2025-12-16Programming5 minutes read (About 761 words)

Cuda Optimize : Vectorized Memory Access

baseline

__global__ void device_copy_scalar_kernel(int* d_in, int* d_out, int N) { 
  int idx = blockIdx.x * blockDim.x + threadIdx.x; 
  for (int i = idx; i < N; i += blockDim.x * gridDim.x) { 
    d_out[i] = d_in[i]; 
  } 
} 

void device_copy_scalar(int* d_in, int* d_out, int N) 
{ 
  int threads = 128; 
  int blocks = min((N + threads-1) / threads, MAX_BLOCKS);  
  device_copy_scalar_kernel<<<blocks, threads>>>(d_in, d_out, N); 
}

简单的分块拷贝。

通过cuobjdump -sass executable.得到对应的标量copy对应的SASS代码

/*0058*/ IMAD R6.CC, R0, R9, c[0x0][0x140]                
/*0060*/ IMAD.HI.X R7, R0, R9, c[0x0][0x144]              
/*0068*/ IMAD R4.CC, R0, R9, c[0x0][0x148]               
/*0070*/ LD.E R2, [R6]                                   
/*0078*/ IMAD.HI.X R5, R0, R9, c[0x0][0x14c]              
/*0090*/ ST.E [R4], R2

（SASS不熟悉，请看SASS一文）

其中4条IMAD指令计算出读取和存储的指令地址R6:R7和R4:R5。第4和6条指令执行32位的访存命令。

Vector way1: CUDA C/C++ standard headers

通过使用int2, int4, or float2

比如将int的指针d_in类型转换然后赋值。

1
2
3

reinterpret_cast<int2*>(d_in)
// simple in C99
(int2*(d_in))

但是需要注意对齐问题，比如

1	reinterpret_cast<int2*>(d_in+1)

这样是非法的。

Vector way2: structures

通过使用对齐的结构体来实现同样的目的。

struct Foo {int a, int b, double c}; // 16 bytes in size
Foo *x, *y;
…
x[i]=y[i];

实际修改LD.E.64

执行for循环次数减半，注意边界处理。

__global__ void device_copy_vector2_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = idx; i < N/2; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int2*>(d_out)[i] = reinterpret_cast<int2*>(d_in)[i];
  }

  // in only one thread, process final element (if there is one)
  if (idx==N/2 && N%2==1)
    d_out[N-1] = d_in[N-1];
}

void device_copy_vector2(int* d_in, int* d_out, int n) {
  threads = 128; 
  blocks = min((N/2 + threads-1) / threads, MAX_BLOCKS); 

  device_copy_vector2_kernel<<<blocks, threads>>>(d_in, d_out, N);
}

对应汇编可以看出

/*0088*/                IMAD R10.CC, R3, R5, c[0x0][0x140]              
/*0090*/                IMAD.HI.X R11, R3, R5, c[0x0][0x144]            
/*0098*/                IMAD R8.CC, R3, R5, c[0x0][0x148]             
/*00a0*/                LD.E.64 R6, [R10]                                      
/*00a8*/                IMAD.HI.X R9, R3, R5, c[0x0][0x14c]           
/*00c8*/                ST.E.64 [R8], R6

变成了LD.E.64

实际修改LD.E.128

执行for循环次数减半，注意边界处理。

__global__ void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];
  }

  // in only one thread, process final elements (if there are any)
  int remainder = N%4;
  if (idx==N/4 && remainder!=0) {
    while(remainder) {
      int idx = N - remainder--;
      d_out[idx] = d_in[idx];
    }
  }
}

void device_copy_vector4(int* d_in, int* d_out, int N) {
  int threads = 128;
  int blocks = min((N/4 + threads-1) / threads, MAX_BLOCKS);

  device_copy_vector4_kernel<<<blocks, threads>>>(d_in, d_out, N);
}

对应汇编可以看出

/*0090*/                IMAD R10.CC, R3, R13, c[0x0][0x140]              
/*0098*/                IMAD.HI.X R11, R3, R13, c[0x0][0x144]            
/*00a0*/                IMAD R8.CC, R3, R13, c[0x0][0x148]               
/*00a8*/                LD.E.128 R4, [R10]                               
/*00b0*/                IMAD.HI.X R9, R3, R13, c[0x0][0x14c]             
/*00d0*/                ST.E.128 [R8], R4

变成了LD.E.128

summary

(个人感觉，提升也不大吗？也没有两倍和四倍的效果)

绝大部分情况，向量比标量好， increase bandwidth, reduce instruction count, and reduce latency. 。

但是会增加额外的寄存器(SASS里也没有看到？？)和降低并行性(什么意思？？？)

参考文献

https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/#entry-content-comments

Posted 2022-05-22Updated 2025-12-16Architecture5 minutes read (About 785 words)

cuda Assembly:PTX & SASS

两种汇编

parallel thread execution (PTX) 内联汇编有没有关系
1. PTX是编程人员可以操作的最底层汇编，原因是SASS代码的实现会经常根据GPU架构而经常变换
2. https://docs.nvidia.com/cuda//pdf/Inline_PTX_Assembly.pdf
3. ISA指令手册 https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set
SASS
1. Streaming ASSembly(Shader Assembly?) 没有官方的证明
2. 没有官方详细的手册，有基本介绍：https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#ampere
3. https://zhuanlan.zhihu.com/p/161624982
4. 从可执行程序反汇编SASS
  1. https://www.findhao.net/easycoding/2339.html

SASS 指令基本信息

对于Ampere架构

指令方向

1	(instruction) (destination) (source1), (source2) ...

各种寄存器说明

RX for registers
URX for uniform registers
SRX for special system-controlled registers
PX for predicate registers
c[X][Y] for constant memory

SASS 举例说明1

SASS的难点在于指令的后缀。由于手册确实，需要结合PTX的后缀查看

1
2
3

/*0028*/         IMAD R6.CC, R3, R5, c[0x0][0x20]; 
/*0030*/         IMAD.HI.X R7, R3, R5, c[0x0][0x24]; 
/*0040*/         LD.E R2, [R6]; //load

line1

1	/0028/ IMAD R6.CC, R3, R5, c[0x0][0x20];

Extended-precision integer multiply-add: multiply R3 with R5, sum with constant in bank 0, offset 0x20, store in R6 with carry-out.

c[BANK][ADDR] is a constant memory。

.CC means “set the flags”

line2

1	/0030/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];

Integer multiply-add with extract: multiply R3 with R5, extract upper half, sum that upper half with constant in bank 0, offset 0x24, store in R7 with carry-in.

line3

1	/0040/ LD.E R2, [R6]; //load

LD.E is a load from global memory using 64-bit address in R6,R7(表面上是R6，其实是R6 与 R7 组成的地址对)

summary

1
2
3

R6 = R3*R5 + c[0x0][0x20], saving carry to CC
R7 = (R3*R5 + c[0x0][0x24])>>32 + CC
R2 = *(R7<<32 + R6)

寄存器是32位的原因是 SMEM的bank是4字节的。c数组将32位的基地址分开存了。

first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20],

leaving 64-bit address result in the R6,R7 pair

对应的代码是

kernel f (uint32* x) // 64-bit pointer
{
   R2 = x[R3*R5]
}

SASS Opt Code分析2

LDG - Load form Global Memory
ULDC - Load from Constant Memory into Uniform register
USHF - Uniform Funnel Shift （猜测是特殊的加速shift）
STS - Store within Local or Shared Window

流水STS

观察偏移

4
2060(delta=2056)
4116(delta=2056)
8228(delta=2 * 2056)
6172(delta=-1 * 2056)
10284(delta=2 * 2056)
12340(delta=2056)

可见汇编就是中间写反了，导致不连续，不然能隐藏更多延迟

STS缓存寄存器来源

那么这些寄存器是怎么来的呢？感觉就是写反了

IMAD.WIDE.U32 R16, R16, R19, c[0x0][0x168] 
LDG.E R27, [R16.64] 
IMAD.WIDE R30, R19, c[0x0][0x164], R16 
LDG.E R31, [R30.64] 
IMAD.WIDE R32, R19, c[0x0][0x164], R30 
LDG.E R39, [R32.64] 
# important R41 R37
IMAD.WIDE R34, R19, c[0x0][0x164], R32 
IMAD.WIDE R40, R19, c[0x0][0x164], R34 
LDG.E R41, [R40.64] 
LDG.E R37, [R34.64]

Fix

原因是前面是手动展开的，假如等待编译器自动展开for循环就不会有这个问题

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://forums.developer.nvidia.com/t/solved-sass-code-analysis/41167/2

https://stackoverflow.com/questions/35055014/how-to-understand-the-result-of-sass-analysis-in-cuda-gpu

baseline

Vector way1: CUDA C/C++ standard headers

Vector way2: structures

实际修改LD.E.64

实际修改LD.E.128

summary

参考文献

两种汇编

SASS 指令基本信息

SASS 举例说明1

line1

line2

line3

summary

SASS Opt Code分析2

流水STS

STS缓存寄存器来源

Fix

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

Categories

Subscribe for updates

follow.it

Links

Recents

Archives

Tags