Nvprof

安装

1
2
$ which nvprof 
/usr/local/cuda/bin/nvprof

基本使用

摘要模式

命令行直接运行

1
nvprof ./myApp

跟踪API

1
nvprof --print-gpu-trace ./myApp

保存在log里

1
sudo /usr/local/cuda/bin/nvprof --log-file a.log --metrics achieved_occupancy /staff/shaojiemike/github/cutests/22-commonstencil/common

可视化

  1. nsight可以直接在远程机器上运行
    1. ssh -X host
    2. .ssh/config
      1. add
      2. XAuthLocation /opt/X11/bin/xauth #for macbookAir
      3. ForwardX11Trusted yes
      4. ForwardX11 yes
  2. Visual Profiler也可以ssh直接连接远程机器
  3. 或者导出分析结果以便可视化, 在Visual Profiler使用
1
2
nvprof --export-profile timeline.prof <app> <app args>
nvprof --analysis-metrics -o nbody-analysis.nvprof ./myApp

profile kernel

1
sudo /usr/local/cuda/bin/ncu -k stencil_kernel -s 0 -c 1 /staff/shaojiemike/github/cutests/22-commonstencil/best

ncu-ui是可视化界面,但是没弄懂

带宽profile

上限测量

1
# shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:02:08]                                                                                                                                                                      $ ./bin/x86_64/linux/release/bandwidthTest                                                                                                                                                                                           [CUDA Bandwidth Test] - Starting...                                                                                                                                                                                                  Running on...                                                                                                                                                                                                                                                                                                                                                                                                                                                              Device 0: Tesla P40                                                                                                                                                                                                                  Quick Mode                                                                                                                                                                                                                                                                                                                                                                                                                                                                Host to Device Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     11.8                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Host Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     13.0                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Device Bandwidth, 1 Device(s)                                                                                                                                                                                              PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     244.3                                                                                                                                                                                                                                                                                                                                                                                                                                     Result = PASS                                                                                                                                                                                                                                                                                                                                                                                                                                                             NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.                                                                                                                                                                                       # shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:03:24]                                                                                                                                                                      $ ./bin/x86_64/linux/release/p2pBandwidthLatencyTest        

实际值

nvprof通过指定与dram,L1或者L2 的metrics来实现。具体解释可以参考官网

在 Maxwell 和之后的架构中 L1 和 SMEM 合并

Metric Name 解释
achieved_occupancy 活跃cycle是 Warps 活跃的比例
dram_read_throughput
dram_utilization 在0到10的范围内,相对于峰值利用率,设备内存的利用率水平
shared_load_throughput
shared_utilization
l2_utilization

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

Author

Shaojie Tan

Posted on

2022-05-12

Updated on

2025-01-30

Licensed under