SHAOJIE'S BOOK

Posted 2024-01-05Updated 2025-01-30Programming21 minutes read (About 3103 words)

The CUDA Execution Model

导言

The CUDA Execution Model: how the CUDA architecture handles the execution of code in parallel.

这篇将聚焦于CUDA的基础知识，基本概念，

Posted 2023-05-11Updated 2025-01-30Tutorials18 minutes read (About 2710 words)

Nvidia Nsight

Nsight system compute & Graph 的关系

Nsight Systems

All developers should start with Nsight Systems to identify the largest optimization opportunities. Nsight Systems provides developers a system-wide visualization of an applications performance. Developers can optimize bottlenecks to scale efficiently across any number or size of CPUs and GPUs; from large servers to our smallest SoC. For further optimizations to compute kernels developers should use Nsight Compute or to further optimize a graphics workloads, use Nsight Graphics.

Nsight Compute

Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. Nsight Compute also provides customizable and data-driven user interface and metric collection that can be extended with analysis scripts for post-processing results.

Nsight Graphics

Nsight Graphics is a standalone application for the debugging, profiling, and analysis of graphics applications on Microsoft Windows and Linux. It allows you to optimize the performance of your Direct3D 11, Direct3D 12, DirectX Raytracing 1.1, OpenGL, Vulkan, and KHR Vulkan Ray Tracing Extension based applications.

Install Nsight local

check the perf config To collect thread scheduling data and IP (instruction pointer) samples
1. cat /proc/sys/kernel/perf_event_paranoid
2. 如果大于2，临时改变 sudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid'重启会重置
3. 永久修改 sudo sh -c 'echo kernel.perf_event_paranoid=2 > /etc/sysctl.d/local.conf'
下载Nsight
1. 但是单独下载要会员
2. 下载cuda toolkit,有集成

Nsight System

目标与功能

运行 nsight-sys，可以从整体上看GPU,CPU资源的使用情况，和分辨出热点函数和kernel，但是对于为什么是热点给不出具体分析。

基本使用

勾选了CUDA-trace, GPU Metrics选项

GPU Metrics 需要 sudo

否则会报错。一般情况下使用sudo能保证0 error

GPU Metrics [0]: The user running Nsight Systems does not have permission to access NVIDIA GPU Performance Counters on the target device. For more details, please visit https://developer.nvidia.com/ERR_NVGPUCTRPERM
  - API function: NVPW_GPU_PeriodicSampler_GetCounterAvailability(&params)
  - Error code: 17
  - Source function: static std::vector<unsigned char> QuadDDaemon::EventSource::GpuMetricsBackend::Impl::CounterConfig::GetCounterAvailabilityImage(uint32_t)
  - Source location: /dvs/p4/build/sw/devtools/Agora/Rel/DTC_F/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetricsBackend.cpp:609

Profile 速度

大致2到3倍时间：默认采样率，单独运行52s, Nsight-sys模拟需要135s。

HPC APP : PCIE, GPU DRAM Bandwidth, Warp

GPU Metrics选项能看出 PCIE, GPU DRAM Bandwidth, Warp的使用情况。

Compute Warps in Flight

将鼠标放在上面会有具体的数值或者名称的解释，(正在使用的Warps)

Unallocated Warps in Active SMs

Definition: This metric represents the number of warps that are not actively executing but are assigned to an active Streaming Multiprocessor (SM).
Interpretation: In CUDA, SMs are the fundamental processing units on the GPU. Each SM can execute multiple warps concurrently. “Unallocated Warps in Active SMs” indicates the number of warps that are ready to be scheduled on an SM but are currently waiting due to resource contention or other factors. A high number may suggest that there is room for additional work but available resources are not fully utilized.

NVTX

由于没有根据kernel function区分，很难读。为此提供了NVTX来给代码打标签

The NVIDIA Tools Extension Library (NVTX)

使用NVTX可以在C代码里插入标记，使得Nvsight能有效监控区域代码。

头文件：[^1]

1	#include <nvToolsExt.h>

需要标记代码前后加入：

nvtxRangePush("checkResult"); //nvtxRangePushA,nvtxRangePushW,nvtxRangePushEx 好像都差不多
checkResult<<<dim3(row_num / TPBX, col_num / TPBY, 1), dim3(TPBX, TPBY, 1)>>>(row_num, col_num, result);
cudaDeviceSynchronize(); 
nvtxRangePop();

注意NVTX是作用在CPU线程上的，无法在GPU里用。

注意需要 g++ -o testnv -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lnvToolsExt testnv.cpp。或者修改cmake来实现同样的效果

NVTX问题：怎么不在同一竖直方向上？GPU还先跑是什么情况[^2]

AI APP: Stable Diffusion XL

具体分析见 Deploy Stable Diffusion to A100

Nsight Compute

Nsight Systems 就是nvprof的继任者，NVIDIA最新的用于监测 kernel timeline的工具。
NVIDIA 计算能力7.5及以上的GPU设备(从A100开始)不再支持nvprof工具进行性能剖析，提示使用Nsight Compute作为替代品.

目标与功能

默认kernel模式，会根据 function的调度关系，将程序划分为kernel

Summary：给出in-order执行的每个kernel的参数，时间，资源占用(寄存器，计算访存单元)信息。
1. Detail: 对于被选择的kernel给出， NV的优化建议
2. Source：对于被选择的kernel给出，给出源代码

基本使用

1
2
3

# recommand running under sudo
ncu # 命令行 Nsight Compute CLI(ncu)
ncu-ui # GUI

Profile速度

目测模拟时间慢百倍。

使用Nsight Compute CLI (nv-nsight-cu-cli / ncu) 输出数据

nv-nsight-cu-cli -> ncu

下面是一个使用样例：

1	/usr/local/NVIDIA-Nsight-Compute/nv-nsight-cu-cli -o mnist -f --csv --profile-from-start off /usr/bin/python3 mnist.py

其中-o是为了输出.nsight-cuprof-report文件用于后续的可视化查看，-f为强制覆盖原有文件，–csv可是在console输出除 timeline 以外数据的时候以逗号分隔数据，方便拷贝至csv文件， –profile-from-start的使用方法和Nsight System以及nvprof一样。其余flag选项可见文档。

上面的例子会生成mnist.nsight-cuprof-report文件。

注意

最前面的可执行文件需要绝对路径，如上面的python3需要使用 /usr/bin/python3。
生成过程中可能会产生很大的临时文件（几十G）。如果本次磁盘空间不够，可以设置如下环境变量来调整存储临时文件的地址。没有找到能直接使用 Nsight Compute 修改临时文件地址的方式。

1	export /TMPDIR=/path/for/tmp

ncu与nvprof命令行抓取参数的映射表

https://www.freesion.com/article/34871449930/

ncu-ui教程

为了显示原代码makefile添加 -g -G选项
对应CmakeList.txt

1
2
3

target_compile_options(better PUBLIC $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda
    -G -src-in-ptx
    >)

https://blog.csdn.net/yan31415/article/details/109491749

ncu-ui表格&图

我不明白我的SMEM怎么不是从DRAM来的，而且峰值怎么这么低？

这个错误也是令人迷惑
The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only accesses an average of 3.7 sectors out of the possible 4 sectors per cache line. Check the Source Counters section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory request.

不知道为什么有1%和2% 的bank conflict

可以看到 SMEM， Register，Block Size是怎么影响GPU Warp的分配调度的。

上图没有拖累，吃满了64个warp。

关于if语句

if语句只要warp里执行相同就行。

可以提示出不连续访问的地方。(这里是这样设计的，已经避免了绝大部分的不连续访问)

显示stall最多的指令是什么以及在等待什么。还有执行最多的指令

假如 file mismatched 手动选择文件就行

stall的信息，感觉就这些有点用。(其中sb是scoreboard的意思)

ncu-ui 分析汇编

PTX&SASS汇编说明

有两种汇编

请看PTX SASS一文

基本说明

可以通过指令执行数或者采样率来得知，执行最多的指令。

鼠标悬停可以知道具体命令的含义

Ex1: for循环头

Ex2: for-loop kernel

1	sdata[Regular_local_index]=arr_data[Regular_global_index];

该从DRAM里读取到SMEM的指令对应的PTX和SASS代码

1	cvt.f32.u16 d, a; // convert 16-bit unsigned to 32-bit float

问题：无效self-mov？

为了隐藏延迟？

直接原因是PTX翻译成SASS。一条mov变多条了

CUDA Visual Profiler

老一代debugger工具，逐渐被Nsight淘汰

1 2	nvprof # 命令行,nsys 之前的名称叫做 nvprof nvvp

在more里有建议

nvprof捕获信息存储

1
2
3

nvprof --analysis-metrics -o  nbody-analysis.nvprof ./nbody --benchmark -numdevices=2 -i=1
# 下面输出 .qdrep 文件
nsys profile --stats=true --force-overwrite=true  -o baseline-report ./single-thread-vector-add

CUDA Visual Profiler 问题

==7196== Warning

解决方法在程序末尾加cudaDeviceReset()或者cudaProfilerStop()

Nsight Compute 问题

OpenGL 没有安装

Warning: Failed to get OpenGL version. OpenGL version 2.0 or higher is required.
OpenGL version is too low (0). Falling back to Mesa software rendering.
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: offscreen, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, xcb.

解决办法

1 2	sudo apt-get install libxcb-xinerama0 sudo apt install libxcb-icccm4 libxcb-image0 libxcb-keysyms1 libxcb-render-util0

Qt插件缺失

qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: xcb.

Application could not be initialized!
    This is likely due to missing Qt platform dependencies.
    For a list of dependencies, please refer to https://doc.qt.io/qt-5/linux-requirements.html
    To view missing libraries, set QT_DEBUG_PLUGINS=1 and re-run the application.

按照说明 export QT_DEBUG_PLUGINS=1再次运行, 显示具体问题

Cannot load library /staff/shaojiemike/Install/cuda_11.7.0_515.43.04_linux/nsight-compute-2022.2.0/host/linux-desktop-glibc_2_11_3-x64/Plugins/platforms/libqxcb.so: (libxcb-xinput.so.0: cannot open shared object file: No such file or directory)

解决 sudo apt-get install libxcb-xinput0

kernel没权限profile

ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device

要用sudo，或者最新的NV

could not connect to display localhost:10.0 under sudo

$ sudo ncu-ui
MobaXterm X11 proxy: Authorisation not recognised
qt.qpa.xcb: could not connect to display localhost:10.0

MobaXterm X11 proxy: Unsupported authorisation protocol
Error: Can't open display: localhost:10.0

解决办法(原因是sudo相当于切换到root用户，丢失了xauth信息)

$ xauth list
snode0/unix:12  MIT-MAGIC-COOKIE-1  84941f1f8be97d19436356685f75b884
snode0/unix:13  MIT-MAGIC-COOKIE-1  5172ee2c7364b055cd37538b460f7741
snode0/unix:11  MIT-MAGIC-COOKIE-1  589f3b5ab852f24ca3710c53e6439260
hades1/unix:10  MIT-MAGIC-COOKIE-1  9346adec202bd65250f3d21239025750
snode0/unix:10  MIT-MAGIC-COOKIE-1  52285c563f1688741fa1b434ed2b7b2c

sudo -s # 切换
xauth add snode0/unix:10  MIT-MAGIC-COOKIE-1  52285c563f1688741fa1b434ed2b7b2c # 补全xauth
# 正常执行 xauth有用的总是最后一个

GPU Metrics [0]: Sampling buffer overflow.

只勾选CUDA Metrics 和 GPU Metrics
降低采样频率

Error 0: UnsupportedGpu

原因是软件对GPU的支持是逐步的需要安装最新的。

不支持的Nsight的可以尝试老的debugger工具 CUDA Visual Profiler

Error: Profiling is not supported on this device

Pascal support was deprecated, then dropped from Nsight Compute after Nsight Compute 2019.5.1.

The profiling tools that support Pascal in the CUDA Toolkit 11.1 and later are nvprof and visual profiler.

需要进一步的研究学习

暂无

遇到的问题

NVTX问题

开题缘由、总结、反思、吐槽~~

参考文献

https://developer.nvidia.com/tools-overview

https://www.365seal.com/y/zyn1yxJQn3.html

[^1]: Usage of NVTX

Posted 2023-05-06Updated 2025-01-30Programming18 minutes read (About 2695 words)

Cuda Program Basic

CUDA编程水平高低的不同，会导致几十上百倍的性能差距。但是这篇将聚焦于CUDA的编程语法，编译与运行。

Posted 2022-05-22Updated 2025-01-30Programming5 minutes read (About 761 words)

Cuda Optimize : Vectorized Memory Access

baseline

__global__ void device_copy_scalar_kernel(int* d_in, int* d_out, int N) { 
  int idx = blockIdx.x * blockDim.x + threadIdx.x; 
  for (int i = idx; i < N; i += blockDim.x * gridDim.x) { 
    d_out[i] = d_in[i]; 
  } 
} 

void device_copy_scalar(int* d_in, int* d_out, int N) 
{ 
  int threads = 128; 
  int blocks = min((N + threads-1) / threads, MAX_BLOCKS);  
  device_copy_scalar_kernel<<<blocks, threads>>>(d_in, d_out, N); 
}

简单的分块拷贝。

通过cuobjdump -sass executable.得到对应的标量copy对应的SASS代码

/*0058*/ IMAD R6.CC, R0, R9, c[0x0][0x140]                
/*0060*/ IMAD.HI.X R7, R0, R9, c[0x0][0x144]              
/*0068*/ IMAD R4.CC, R0, R9, c[0x0][0x148]               
/*0070*/ LD.E R2, [R6]                                   
/*0078*/ IMAD.HI.X R5, R0, R9, c[0x0][0x14c]              
/*0090*/ ST.E [R4], R2

（SASS不熟悉，请看SASS一文）

其中4条IMAD指令计算出读取和存储的指令地址R6:R7和R4:R5。第4和6条指令执行32位的访存命令。

Vector way1: CUDA C/C++ standard headers

通过使用int2, int4, or float2

比如将int的指针d_in类型转换然后赋值。

1
2
3

reinterpret_cast<int2*>(d_in)
// simple in C99
(int2*(d_in))

但是需要注意对齐问题，比如

1	reinterpret_cast<int2*>(d_in+1)

这样是非法的。

Vector way2: structures

通过使用对齐的结构体来实现同样的目的。

struct Foo {int a, int b, double c}; // 16 bytes in size
Foo *x, *y;
…
x[i]=y[i];

实际修改LD.E.64

执行for循环次数减半，注意边界处理。

__global__ void device_copy_vector2_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = idx; i < N/2; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int2*>(d_out)[i] = reinterpret_cast<int2*>(d_in)[i];
  }

  // in only one thread, process final element (if there is one)
  if (idx==N/2 && N%2==1)
    d_out[N-1] = d_in[N-1];
}

void device_copy_vector2(int* d_in, int* d_out, int n) {
  threads = 128; 
  blocks = min((N/2 + threads-1) / threads, MAX_BLOCKS); 

  device_copy_vector2_kernel<<<blocks, threads>>>(d_in, d_out, N);
}

对应汇编可以看出

/*0088*/                IMAD R10.CC, R3, R5, c[0x0][0x140]              
/*0090*/                IMAD.HI.X R11, R3, R5, c[0x0][0x144]            
/*0098*/                IMAD R8.CC, R3, R5, c[0x0][0x148]             
/*00a0*/                LD.E.64 R6, [R10]                                      
/*00a8*/                IMAD.HI.X R9, R3, R5, c[0x0][0x14c]           
/*00c8*/                ST.E.64 [R8], R6

变成了LD.E.64

实际修改LD.E.128

执行for循环次数减半，注意边界处理。

__global__ void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];
  }

  // in only one thread, process final elements (if there are any)
  int remainder = N%4;
  if (idx==N/4 && remainder!=0) {
    while(remainder) {
      int idx = N - remainder--;
      d_out[idx] = d_in[idx];
    }
  }
}

void device_copy_vector4(int* d_in, int* d_out, int N) {
  int threads = 128;
  int blocks = min((N/4 + threads-1) / threads, MAX_BLOCKS);

  device_copy_vector4_kernel<<<blocks, threads>>>(d_in, d_out, N);
}

对应汇编可以看出

/*0090*/                IMAD R10.CC, R3, R13, c[0x0][0x140]              
/*0098*/                IMAD.HI.X R11, R3, R13, c[0x0][0x144]            
/*00a0*/                IMAD R8.CC, R3, R13, c[0x0][0x148]               
/*00a8*/                LD.E.128 R4, [R10]                               
/*00b0*/                IMAD.HI.X R9, R3, R13, c[0x0][0x14c]             
/*00d0*/                ST.E.128 [R8], R4

变成了LD.E.128

summary

(个人感觉，提升也不大吗？也没有两倍和四倍的效果)

绝大部分情况，向量比标量好， increase bandwidth, reduce instruction count, and reduce latency. 。

但是会增加额外的寄存器(SASS里也没有看到？？)和降低并行性(什么意思？？？)

参考文献

https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/#entry-content-comments

Posted 2022-05-21Updated 2025-01-30Tutorials4 minutes read (About 578 words)

Cuda Optimize : Stencil

课程报告PPT

有对应的PPT，代码。

最终将1000ms程序优化到1~2ms

乔良师兄有根据知乎介绍如何利用寄存器文件缓存

SMEM难点: 跨线程访存

不仅每个线程需要访问自己划分对应区域之外的元素
而且访问的总个数也不是线程数对应的倍数

导致Embarrassingly Parallel Problems

1D 梯度计算 Stencil实例

计算某点的梯度，需要前后的function值。

Halo/Ghost Cells 光晕

问题:
对于边界上的cells，需要访问相邻区域的元素。

解决办法:
将他们也加入进当前block的SMEM

Indexing with Halo Cells

Stencil问题的半径 radius (RAD) 是边缘元素需要的某方向的额外元素
1. 在梯度的例子里是1
SMEM声明的大小，需要在每个维度上都增加 2*RAD的个数
这导致SMEM的index的每个维度需要增加RAD. s_idx = threadIdx.x + RAD;

code

int main() {
    const float PI = 3.1415927;
    const int N = 150;
    const float h = 2 * PI / N;
    float x[N] = { 0.0 };
    float u[N] = { 0.0 };
    float result_parallel[N] = { 0.0 };
    for (int i = 0; i < N; ++i) {
        x[i] = 2 * PI*i / N;
        u[i] = sinf(x[i]);
    }
    ddParallel(result_parallel, u, N, h);
}

Kernel Launching

#define TPB 64
#define RAD 1 // radius of the stencil
…
void ddParallel(float *out, const float *in, int n, float h) {
    float *d_in = 0, *d_out = 0;
    cudaMalloc(&d_in, n * sizeof(float));
    cudaMalloc(&d_out, n * sizeof(float));
    cudaMemcpy(d_in, in, n * sizeof(float), cudaMemcpyHostToDevice);

    // Set shared memory size in bytes
    const size_t smemSize = (TPB + 2 * RAD) * sizeof(float);
    ddKernel<<<(n + TPB - 1)/TPB, TPB, smemSize>>>(d_out, d_in, n, h);
    cudaMemcpy(out, d_out, n * sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(d_in);
    cudaFree(d_out);
}

Kernel Definition

__global__ void ddKernel(float *d_out, const float *d_in, int size, float h) {
    const int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i >= size) return;
  
    const int s_idx = threadIdx.x + RAD;
    extern __shared__ float s_in[];

    // Regular cells
    s_in[s_idx] = d_in[i];
    // Halo cells
    if (threadIdx.x < RAD) {
        s_in[s_idx - RAD] = d_in[i - RAD];
        s_in[s_idx + blockDim.x] = d_in[i + blockDim.x];
    }
    __syncthreads();
    d_out[i] = (s_in[s_idx-1] - 2.f*s_in[s_idx] + s_in[s_idx+1])/(h*h);
}

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

研一下USTC并行计算自己的选题

参考文献

https://dumas.ccsd.cnrs.fr/dumas-00636254/document

https://indico.fysik.su.se/event/6743/contributions/10338/attachments/4175/4801/4.CUDA-StencilsSharedMemory-Markidis.pdf

Posted 2022-05-14Updated 2025-01-30Tutorials18 minutes read (About 2668 words)

Cuda Optimize

Outline

General optimization guidance
1. Coalescing memory operations
2. Occupancy and latency hiding
3. Using shared memory
Example 1: transpose
1. Coalescing and bank conflict avoidance
Example 2: efficient parallel reductions
1. Using peak performance metrics to guide optimization
2. Avoiding SIMD divergence & bank conflicts
3. Loop unrolling
4. Using template parameters to write general-yet-optimized code
5. Algorithmic strategy: Cost efficiency

CUDA 优化策略

基础

最大化并行独立性
最大化计算密度

减少数据传输

数据可以直接在GPU生成。
一次大传输也比分开的小批次快

zerocopy

如果我们数据只会在 GPU 产生和使用，我们不需要来回进行拷贝。

https://migocpp.wordpress.com/2018/06/08/cuda-memory-access-global-zero-copy-unified/

简而言之，在 host 使用命令：cudaHostRegisterMapped
之后用 cudaHostGetDevicePointer 进行映射
最后解除绑定 cudaHostUnregister

即，

// First, pin the memory (or cudaHostAlloc instead)
cudaHostRegister(h_a, …, cudaHostRegisterMapped);
cudaHostRegister(h_b, …, cudaHostRegisterMapped);
cudaHostRegister(h_c, …, cudaHostRegisterMapped);

cudaHostGetDevicePointer(&a, h_a, 0);
cudaHostGetDevicePointer(&b, h_b, 0);
cudaHostGetDevicePointer(&c, h_c, 0);

kernel<<<...>>>(a, b, c);
cudaDeviceSynchronize();

// unpin/release host memory
cudaHostUnregister(h_a);
cudaHostUnregister(h_b);
cudaHostUnregister(h_c);

cuda warp shuffle

只要两个thread在同一个warp中，允许thread直接读其他thread的寄存器值，这种比通过shared Memory进行thread间的通讯效果更好，latency更低，同时也不消耗额外的内存资源来执行数据交换。ref

访存连续性

Optimize for spatial locality in cached texture memory ???
避免bank conflict：如果没有bank冲突的话，共享内存的访存速度将会非常的快，大约比全局内存的访问延迟低100多倍，但是速度没有寄存器快。然而，如果在使用共享内存时发生了bank冲突的话，性能将会降低很多很多。

Global Memory：coalesced access

对齐(Starting address for a region must be a multiple of region size)集体访问，有数量级的差异Coalesced

利用好每个block里的thread，全部每个线程各自读取自己对齐(Starting address for a region must be a multiple of region size 不一定是自己用的)数据到shared memory开辟的总空间。由于需要的数据全部合力读取进来了，计算时正常使用需要的读入的数据。

特别是对于结构体使用SoA(structure of arrays)而不是AoS（array of structures），
如果结构体实在不能对齐, 可以使用 __align(X), where X = 4, 8, or 16.强制对齐。

对齐读取 float3 code

对于small Kernel和访存瓶颈的Kernel影响很大

由于需要对齐读取，3float是12字节，所以只能拆成三份。

有无采用对齐shared读取，有10倍的加速。

利用好Shared Memory

比globalMemory快百倍
可以来避免 non-Coalesced access
SM的线程可以共享
Use one / a few threads to load / compute data shared by all threads

隐藏延迟的方法

增加SM上线程数量，
block数> SM数，这样所有的multiprocessors至少有一个block执行
threads/block>128 。原因：机器上一般有最多4个Warp调度器=4*32=128
threadsInblock=N*WarpSize=N*32
在 SM 上的 TB 越多越好，让 Thread Block 不停的跑我们的利用率就会高。
但是如果 Thread Block 太多，我们每一个 SM 能分配的寄存器就会变少，所以就会发生 Register Spill, 使用更高级的 L1、L2 Cache 去代替 Registers。所以 TB 不能太多，需要减少 Register Spill 的次数。
1. 资源占用率不要太高（最多一半？
多使用 __syncthreads
最好的参数需要self-tuning出来

占用率高不一定是好事

占用率是指每个多处理器（Streaming Multiprocessor，SM）的实际的活动warps数量与最大理论的warps数量的比率。
高的占用率不一定能提升性能，因为这一般意味着每个线程分配的寄存器和shared memory变少。但低的占用率会导致内存延迟无法隐藏。

实际需要计算每个线程大概需要的shared memory和register数量

实际例子测试-待研究

https://www.cnblogs.com/1024incn/p/4541313.html

https://www.cnblogs.com/1024incn/p/4545265.html

优化实例1 - 矩阵转置

通过SMEM实现coalescing access

原本代码

_global__ void transpose_naive(float *odata, float *idata, int width, int height)
{
   unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
   unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
   if (xIndex < width && yIndex < height)
   {
      unsigned int index_in = xIndex + width * yIndex;
      unsigned int index_out = yIndex + height * xIndex;
      odata[index_out] = idata[index_in]; 
   }
}

思想：将大矩阵划分成方块，并且存储在SMEM里。不仅SMEM速度更快，而且每行元素个数变少，跨行访问的间距变小，局部性增强。而且对于大矩阵加速效果会更明显。

__global__ void transpose(float *odata, float *idata, int width, int height)
{
   __shared__ float block[BLOCK_DIM*BLOCK_DIM];
   unsigned int xBlock = blockDim.x * blockIdx.x;
   unsigned int yBlock = blockDim.y * blockIdx.y;
   unsigned int xIndex = xBlock + threadIdx.x;
   unsigned int yIndex = yBlock + threadIdx.y;
   unsigned int index_out, index_transpose;
   if (xIndex < width && yIndex < height)
   {
      unsigned int index_in = width * yIndex + xIndex;
      unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
      block[index_block] = idata[index_in];
      index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
      index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
   }
   __syncthreads();
   if (xIndex < width && yIndex < height)
      odata[index_out] = block[index_transpose]
}

coalescing access

when Block/tile dimensions are multiples of 16 ???

关于bank conflict

https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/

对于一个32 × 32个元素的共享内存块，一列数据中的所有元素都映射到相同的SMEM bank ，导致bank conflict 的最坏情况:读取一列数据会导致32路的存储库冲突。

幸运的是，只需要将tile的元素宽度改为33，而不是32就行。

优化实例2 - 数据归约

具体问题：将长数组的所有元素，归约求和为一个结果。[^1][^2]

总体思路

为了避免全局同步的巨大开销，采取分级归约

由于归约的计算密度低
1 flop per element loaded (bandwidth-optimal)

所以优化目标是将访存带宽用满。

1 2	384-bit memory interface, 900 MHz DDR 384 * 1800 / 8 = 86.4 GB/s

step0 : baseline - Interleaved Addressing 交错/间隔寻址

__global__ void reduce0(int *g_idata, int *g_odata) {
   extern __shared__ int sdata[];

   // each thread loads one element from global to shared mem
   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
   sdata[tid] = g_idata[i];
   __syncthreads();

   // do reduction in shared mem
   for(unsigned int s=1; s < blockDim.x; s *= 2) {
      if (tid % (s) == 0) {
         sdata[tid] += sdata[tid + s];
      }
      __syncthreads();
   }

   // write result for this block to global mem
   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

工作的线程越来越少。一开始是全部，最后一次只有thread0.

Step1 : 使用连续的index

Just replace divergent branch With strided index and non-divergent branch，但是会带来bank conflict。

原理和Warp发射有关，假如在这里每个Warp并行的线程是2。一个Warp运行耗时为T.

Step0: 4+4+2+1=11T

Step1: 4+2+1+1=8T

for (unsigned int s=1; s < blockDim.x; s *= 2) {
   int index = 2 * s * tid;
   if (index < blockDim.x) {
      sdata[index] += sdata[index + s];
   }
   __syncthreads();
}

Step2: 连续寻址

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
   if (tid < s) {
      sdata[tid] += sdata[tid + s];
   }
   __syncthreads();
}

原本寻址

现在寻址有一边连续了

Step3 : 弥补浪费的线程

方法：在load SMEM的时候提前做一次规约加法，通过减少一半的block数，将原本两个block里的值load+add存储在sum里。

// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();

step4 : Unrolling the Last Warp

当s< 32的时候，就只有一个Warp工作了。

使用warp的SIMD还省去了__syncthreads()的麻烦

for (unsigned int s=blockDim.x/2; s>32; s>>=1) 
{
   if (tid < s)
      sdata[tid] += sdata[tid + s];
   __syncthreads();
}
if (tid < 32)
{
   sdata[tid] += sdata[tid + 32]; 
   sdata[tid] += sdata[tid + 16]; 
   sdata[tid] += sdata[tid + 8]; 
   sdata[tid] += sdata[tid + 4]; 
   sdata[tid] += sdata[tid + 2]; 
   sdata[tid] += sdata[tid + 1]; 
}

为了保持整洁，最后一个if还做了无效的计算。eg, Warp里的最后一个线程只有第一句命令有用。

Step5 : 根据blockSize完全展开for和去除代码

由于for循环里是二分的，而且小于32的单独处理了，导致for循环里实际运行代码最多就3句。

利用代码模板和编译器的自动优化实现：

1 2	template <unsigned int blockSize> __global__ void reduce5(int g_idata, int g_odata)

红色代码会在编译时自动优化。

step6 ：归并算法优化

加速级联？？

Cost= processors × time complexity

我们知道N个元素直接二叉树归约是O(log N)
时间 Cost=N*O(log N).

但是假如只有P个线程先做N/P的串行加法, 然后是log(P)的归约。
总cost=P(N/P+log(P))

当P=N/log(N), cost=O(N)

each thread should sum O(log n) elements来设置

比如，1024 or 2048 elements per block vs. 256 线程。每个sum n=4个元素。具体参数要perf

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
   sdata[tid] += g_idata[i] + g_idata[i+blockSize];
   i += gridSize;
}
__syncthreads();

final code

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*(blockSize*2) + tid;
   unsigned int gridSize = blockSize*2*gridDim.x;
   sdata[tid] = 0;

   do { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } while (i < n);
   __syncthreads();

   if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
   if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
   if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

   if (tid < 32) {
      if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
      if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
      if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
      if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
      if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
      if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
   }
   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

关于if语句的补充

有if语句是没问题的，只要运行的时候全部执行if或者else就行。不要有些执行if，有些执行else，这才会等待。

说不定也不是全部执行if或者else就行，只需要连续32个Thread Index，是相同的执行就行。（猜想，需要测试。

关于延迟隐藏

通过增加block里的线程数，并且同时读取来隐藏延迟。不仅可以隐藏Global Memory的延迟，还可以隐藏写后读的延迟

线程资源查看

线程太多会导致分配到每一个的寄存器和SMEM变少

通过编译时加-cubin选项，.cubin文件前几行会显示

architecture {sm_10}
abiversion {0}
modname {cubin}
code {
   name = BlackScholesGPU
   lmem = 0    # per thread local memory
   smem = 68   # per thread block shared memory
   reg = 20    # per thread registers

参考文献

[^1]: SC07 Optimizing Parallel Reduction in CUDA - Mark Harris

[^2]: 2009 清华邓仰东 cuda lecture pdf 注意也是参考的SC07 Nvidia。

Posted 2022-05-12Updated 2025-01-30Tutorials3 minutes read (About 424 words)

Nvprof

安装

1 2	$ which nvprof /usr/local/cuda/bin/nvprof

基本使用

摘要模式

命令行直接运行

1	nvprof ./myApp

跟踪API

1	nvprof --print-gpu-trace ./myApp

保存在log里

1	sudo /usr/local/cuda/bin/nvprof --log-file a.log --metrics achieved_occupancy /staff/shaojiemike/github/cutests/22-commonstencil/common

可视化

nsight可以直接在远程机器上运行
1. ssh -X host
2. .ssh/config
  1. add
  2. XAuthLocation /opt/X11/bin/xauth #for macbookAir
  3. ForwardX11Trusted yes
  4. ForwardX11 yes
Visual Profiler也可以ssh直接连接远程机器
或者导出分析结果以便可视化, 在Visual Profiler使用

1 2	nvprof --export-profile timeline.prof <app> <app args> nvprof --analysis-metrics -o nbody-analysis.nvprof ./myApp

profile kernel

1	sudo /usr/local/cuda/bin/ncu -k stencil_kernel -s 0 -c 1 /staff/shaojiemike/github/cutests/22-commonstencil/best

ncu-ui是可视化界面，但是没弄懂

带宽profile

上限测量

# shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:02:08]                                                                                                                                                                      $ ./bin/x86_64/linux/release/bandwidthTest                                                                                                                                                                                           [CUDA Bandwidth Test] - Starting...                                                                                                                                                                                                  Running on...                                                                                                                                                                                                                                                                                                                                                                                                                                                              Device 0: Tesla P40                                                                                                                                                                                                                  Quick Mode                                                                                                                                                                                                                                                                                                                                                                                                                                                                Host to Device Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     11.8                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Host Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     13.0                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Device Bandwidth, 1 Device(s)                                                                                                                                                                                              PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     244.3                                                                                                                                                                                                                                                                                                                                                                                                                                     Result = PASS                                                                                                                                                                                                                                                                                                                                                                                                                                                             NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.                                                                                                                                                                                       # shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:03:24]                                                                                                                                                                      $ ./bin/x86_64/linux/release/p2pBandwidthLatencyTest

实际值

nvprof通过指定与dram，L1或者L2 的metrics来实现。具体解释可以参考官网

在 Maxwell 和之后的架构中 L1 和 SMEM 合并

Metric Name	解释
achieved_occupancy	活跃cycle是 Warps 活跃的比例
dram_read_throughput
dram_utilization	在0到10的范围内，相对于峰值利用率，设备内存的利用率水平
shared_load_throughput
shared_utilization
l2_utilization

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无