SHAOJIE'S BOOK

Posted 2022-05-22Updated 2026-02-18Architecture5 minutes read (About 785 words)

两种汇编

parallel thread execution (PTX) 内联汇编有没有关系
1. PTX是编程人员可以操作的最底层汇编，原因是SASS代码的实现会经常根据GPU架构而经常变换
2. https://docs.nvidia.com/cuda//pdf/Inline_PTX_Assembly.pdf
3. ISA指令手册 https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set
SASS
1. Streaming ASSembly(Shader Assembly?) 没有官方的证明
2. 没有官方详细的手册，有基本介绍：https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#ampere
3. https://zhuanlan.zhihu.com/p/161624982
4. 从可执行程序反汇编SASS
  1. https://www.findhao.net/easycoding/2339.html

SASS 指令基本信息

对于Ampere架构

指令方向

1	(instruction) (destination) (source1), (source2) ...

各种寄存器说明

RX for registers
URX for uniform registers
SRX for special system-controlled registers
PX for predicate registers
c[X][Y] for constant memory

SASS 举例说明1

SASS的难点在于指令的后缀。由于手册确实，需要结合PTX的后缀查看

1
2
3

/*0028*/         IMAD R6.CC, R3, R5, c[0x0][0x20]; 
/*0030*/         IMAD.HI.X R7, R3, R5, c[0x0][0x24]; 
/*0040*/         LD.E R2, [R6]; //load

line1

1	/0028/ IMAD R6.CC, R3, R5, c[0x0][0x20];

Extended-precision integer multiply-add: multiply R3 with R5, sum with constant in bank 0, offset 0x20, store in R6 with carry-out.

c[BANK][ADDR] is a constant memory。

.CC means “set the flags”

line2

1	/0030/ IMAD.HI.X R7, R3, R5, c[0x0][0x24];

Integer multiply-add with extract: multiply R3 with R5, extract upper half, sum that upper half with constant in bank 0, offset 0x24, store in R7 with carry-in.

line3

1	/0040/ LD.E R2, [R6]; //load

LD.E is a load from global memory using 64-bit address in R6,R7(表面上是R6，其实是R6 与 R7 组成的地址对)

summary

1
2
3

R6 = R3*R5 + c[0x0][0x20], saving carry to CC
R7 = (R3*R5 + c[0x0][0x24])>>32 + CC
R2 = *(R7<<32 + R6)

寄存器是32位的原因是 SMEM的bank是4字节的。c数组将32位的基地址分开存了。

first two commands multiply two 32-bit values (R3 and R5) and add 64-bit value c[0x0][0x24]<<32+c[0x0][0x20],

leaving 64-bit address result in the R6,R7 pair

对应的代码是

kernel f (uint32* x) // 64-bit pointer
{
   R2 = x[R3*R5]
}

SASS Opt Code分析2

LDG - Load form Global Memory
ULDC - Load from Constant Memory into Uniform register
USHF - Uniform Funnel Shift （猜测是特殊的加速shift）
STS - Store within Local or Shared Window

流水STS

观察偏移

4
2060(delta=2056)
4116(delta=2056)
8228(delta=2 * 2056)
6172(delta=-1 * 2056)
10284(delta=2 * 2056)
12340(delta=2056)

可见汇编就是中间写反了，导致不连续，不然能隐藏更多延迟

STS缓存寄存器来源

那么这些寄存器是怎么来的呢？感觉就是写反了

IMAD.WIDE.U32 R16, R16, R19, c[0x0][0x168] 
LDG.E R27, [R16.64] 
IMAD.WIDE R30, R19, c[0x0][0x164], R16 
LDG.E R31, [R30.64] 
IMAD.WIDE R32, R19, c[0x0][0x164], R30 
LDG.E R39, [R32.64] 
# important R41 R37
IMAD.WIDE R34, R19, c[0x0][0x164], R32 
IMAD.WIDE R40, R19, c[0x0][0x164], R34 
LDG.E R41, [R40.64] 
LDG.E R37, [R34.64]

Fix

原因是前面是手动展开的，假如等待编译器自动展开for循环就不会有这个问题

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://forums.developer.nvidia.com/t/solved-sass-code-analysis/41167/2

https://stackoverflow.com/questions/35055014/how-to-understand-the-result-of-sass-analysis-in-cuda-gpu

Posted 2022-05-21Updated 2026-02-18Tutorials4 minutes read (About 578 words)

Cuda Optimize : Stencil

课程报告PPT

有对应的PPT，代码。

最终将1000ms程序优化到1~2ms

乔良师兄有根据知乎介绍如何利用寄存器文件缓存

SMEM难点: 跨线程访存

不仅每个线程需要访问自己划分对应区域之外的元素
而且访问的总个数也不是线程数对应的倍数

导致Embarrassingly Parallel Problems

1D 梯度计算 Stencil实例

计算某点的梯度，需要前后的function值。

Halo/Ghost Cells 光晕

问题:
对于边界上的cells，需要访问相邻区域的元素。

解决办法:
将他们也加入进当前block的SMEM

Indexing with Halo Cells

Stencil问题的半径 radius (RAD) 是边缘元素需要的某方向的额外元素
1. 在梯度的例子里是1
SMEM声明的大小，需要在每个维度上都增加 2*RAD的个数
这导致SMEM的index的每个维度需要增加RAD. s_idx = threadIdx.x + RAD;

code

int main() {
    const float PI = 3.1415927;
    const int N = 150;
    const float h = 2 * PI / N;
    float x[N] = { 0.0 };
    float u[N] = { 0.0 };
    float result_parallel[N] = { 0.0 };
    for (int i = 0; i < N; ++i) {
        x[i] = 2 * PI*i / N;
        u[i] = sinf(x[i]);
    }
    ddParallel(result_parallel, u, N, h);
}

Kernel Launching

#define TPB 64
#define RAD 1 // radius of the stencil
…
void ddParallel(float *out, const float *in, int n, float h) {
    float *d_in = 0, *d_out = 0;
    cudaMalloc(&d_in, n * sizeof(float));
    cudaMalloc(&d_out, n * sizeof(float));
    cudaMemcpy(d_in, in, n * sizeof(float), cudaMemcpyHostToDevice);

    // Set shared memory size in bytes
    const size_t smemSize = (TPB + 2 * RAD) * sizeof(float);
    ddKernel<<<(n + TPB - 1)/TPB, TPB, smemSize>>>(d_out, d_in, n, h);
    cudaMemcpy(out, d_out, n * sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(d_in);
    cudaFree(d_out);
}

Kernel Definition

__global__ void ddKernel(float *d_out, const float *d_in, int size, float h) {
    const int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i >= size) return;
  
    const int s_idx = threadIdx.x + RAD;
    extern __shared__ float s_in[];

    // Regular cells
    s_in[s_idx] = d_in[i];
    // Halo cells
    if (threadIdx.x < RAD) {
        s_in[s_idx - RAD] = d_in[i - RAD];
        s_in[s_idx + blockDim.x] = d_in[i + blockDim.x];
    }
    __syncthreads();
    d_out[i] = (s_in[s_idx-1] - 2.f*s_in[s_idx] + s_in[s_idx+1])/(h*h);
}

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

研一下USTC并行计算自己的选题

参考文献

https://dumas.ccsd.cnrs.fr/dumas-00636254/document

https://indico.fysik.su.se/event/6743/contributions/10338/attachments/4175/4801/4.CUDA-StencilsSharedMemory-Markidis.pdf

Posted 2022-05-16Updated 2026-02-18Tipsa minute read (About 170 words)

Postgraduate dormitory

高新区宿舍（男

西电梯间

宿舍走道

寝室内

洗漱台

淋雨间

卫生间（马桶（我们的变杂物间了

四人宿舍

某人的宿舍位~~(一定不是我的)~~

ps：实验室装修时的图

Posted 2022-05-16Updated 2026-02-18Tutorials21 minutes read (About 3198 words)

Css & Scss

CSS (Cascading Style Sheets) 和 SCSS (Sassy CSS) 都是用于样式表的编程语言，用于定义网页的外观和布局。

Posted 2022-05-14Updated 2026-02-18Tutorials18 minutes read (About 2668 words)

Cuda Optimize

Outline

General optimization guidance
1. Coalescing memory operations
2. Occupancy and latency hiding
3. Using shared memory
Example 1: transpose
1. Coalescing and bank conflict avoidance
Example 2: efficient parallel reductions
1. Using peak performance metrics to guide optimization
2. Avoiding SIMD divergence & bank conflicts
3. Loop unrolling
4. Using template parameters to write general-yet-optimized code
5. Algorithmic strategy: Cost efficiency

CUDA 优化策略

基础

最大化并行独立性
最大化计算密度

减少数据传输

数据可以直接在GPU生成。
一次大传输也比分开的小批次快

zerocopy

如果我们数据只会在 GPU 产生和使用，我们不需要来回进行拷贝。

https://migocpp.wordpress.com/2018/06/08/cuda-memory-access-global-zero-copy-unified/

简而言之，在 host 使用命令：cudaHostRegisterMapped
之后用 cudaHostGetDevicePointer 进行映射
最后解除绑定 cudaHostUnregister

即，

// First, pin the memory (or cudaHostAlloc instead)
cudaHostRegister(h_a, …, cudaHostRegisterMapped);
cudaHostRegister(h_b, …, cudaHostRegisterMapped);
cudaHostRegister(h_c, …, cudaHostRegisterMapped);

cudaHostGetDevicePointer(&a, h_a, 0);
cudaHostGetDevicePointer(&b, h_b, 0);
cudaHostGetDevicePointer(&c, h_c, 0);

kernel<<<...>>>(a, b, c);
cudaDeviceSynchronize();

// unpin/release host memory
cudaHostUnregister(h_a);
cudaHostUnregister(h_b);
cudaHostUnregister(h_c);

cuda warp shuffle

只要两个thread在同一个warp中，允许thread直接读其他thread的寄存器值，这种比通过shared Memory进行thread间的通讯效果更好，latency更低，同时也不消耗额外的内存资源来执行数据交换。ref

访存连续性

Optimize for spatial locality in cached texture memory ???
避免bank conflict：如果没有bank冲突的话，共享内存的访存速度将会非常的快，大约比全局内存的访问延迟低100多倍，但是速度没有寄存器快。然而，如果在使用共享内存时发生了bank冲突的话，性能将会降低很多很多。

Global Memory：coalesced access

对齐(Starting address for a region must be a multiple of region size)集体访问，有数量级的差异Coalesced

利用好每个block里的thread，全部每个线程各自读取自己对齐(Starting address for a region must be a multiple of region size 不一定是自己用的)数据到shared memory开辟的总空间。由于需要的数据全部合力读取进来了，计算时正常使用需要的读入的数据。

特别是对于结构体使用SoA(structure of arrays)而不是AoS（array of structures），
如果结构体实在不能对齐, 可以使用 __align(X), where X = 4, 8, or 16.强制对齐。

对齐读取 float3 code

对于small Kernel和访存瓶颈的Kernel影响很大

由于需要对齐读取，3float是12字节，所以只能拆成三份。

有无采用对齐shared读取，有10倍的加速。

利用好Shared Memory

比globalMemory快百倍
可以来避免 non-Coalesced access
SM的线程可以共享
Use one / a few threads to load / compute data shared by all threads

隐藏延迟的方法

增加SM上线程数量，
block数> SM数，这样所有的multiprocessors至少有一个block执行
threads/block>128 。原因：机器上一般有最多4个Warp调度器=4*32=128
threadsInblock=N*WarpSize=N*32
在 SM 上的 TB 越多越好，让 Thread Block 不停的跑我们的利用率就会高。
但是如果 Thread Block 太多，我们每一个 SM 能分配的寄存器就会变少，所以就会发生 Register Spill, 使用更高级的 L1、L2 Cache 去代替 Registers。所以 TB 不能太多，需要减少 Register Spill 的次数。
1. 资源占用率不要太高（最多一半？
多使用 __syncthreads
最好的参数需要self-tuning出来

占用率高不一定是好事

占用率是指每个多处理器（Streaming Multiprocessor，SM）的实际的活动warps数量与最大理论的warps数量的比率。
高的占用率不一定能提升性能，因为这一般意味着每个线程分配的寄存器和shared memory变少。但低的占用率会导致内存延迟无法隐藏。

实际需要计算每个线程大概需要的shared memory和register数量

实际例子测试-待研究

https://www.cnblogs.com/1024incn/p/4541313.html

https://www.cnblogs.com/1024incn/p/4545265.html

优化实例1 - 矩阵转置

通过SMEM实现coalescing access

原本代码

_global__ void transpose_naive(float *odata, float *idata, int width, int height)
{
   unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
   unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
   if (xIndex < width && yIndex < height)
   {
      unsigned int index_in = xIndex + width * yIndex;
      unsigned int index_out = yIndex + height * xIndex;
      odata[index_out] = idata[index_in]; 
   }
}

思想：将大矩阵划分成方块，并且存储在SMEM里。不仅SMEM速度更快，而且每行元素个数变少，跨行访问的间距变小，局部性增强。而且对于大矩阵加速效果会更明显。

__global__ void transpose(float *odata, float *idata, int width, int height)
{
   __shared__ float block[BLOCK_DIM*BLOCK_DIM];
   unsigned int xBlock = blockDim.x * blockIdx.x;
   unsigned int yBlock = blockDim.y * blockIdx.y;
   unsigned int xIndex = xBlock + threadIdx.x;
   unsigned int yIndex = yBlock + threadIdx.y;
   unsigned int index_out, index_transpose;
   if (xIndex < width && yIndex < height)
   {
      unsigned int index_in = width * yIndex + xIndex;
      unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;
      block[index_block] = idata[index_in];
      index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;
      index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
   }
   __syncthreads();
   if (xIndex < width && yIndex < height)
      odata[index_out] = block[index_transpose]
}

coalescing access

when Block/tile dimensions are multiples of 16 ???

关于bank conflict

https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/

对于一个32 × 32个元素的共享内存块，一列数据中的所有元素都映射到相同的SMEM bank ，导致bank conflict 的最坏情况:读取一列数据会导致32路的存储库冲突。

幸运的是，只需要将tile的元素宽度改为33，而不是32就行。

优化实例2 - 数据归约

具体问题：将长数组的所有元素，归约求和为一个结果。[^1][^2]

总体思路

为了避免全局同步的巨大开销，采取分级归约

由于归约的计算密度低
1 flop per element loaded (bandwidth-optimal)

所以优化目标是将访存带宽用满。

1 2	384-bit memory interface, 900 MHz DDR 384 * 1800 / 8 = 86.4 GB/s

step0 : baseline - Interleaved Addressing 交错/间隔寻址

__global__ void reduce0(int *g_idata, int *g_odata) {
   extern __shared__ int sdata[];

   // each thread loads one element from global to shared mem
   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
   sdata[tid] = g_idata[i];
   __syncthreads();

   // do reduction in shared mem
   for(unsigned int s=1; s < blockDim.x; s *= 2) {
      if (tid % (s) == 0) {
         sdata[tid] += sdata[tid + s];
      }
      __syncthreads();
   }

   // write result for this block to global mem
   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

工作的线程越来越少。一开始是全部，最后一次只有thread0.

Step1 : 使用连续的index

Just replace divergent branch With strided index and non-divergent branch，但是会带来bank conflict。

原理和Warp发射有关，假如在这里每个Warp并行的线程是2。一个Warp运行耗时为T.

Step0: 4+4+2+1=11T

Step1: 4+2+1+1=8T

for (unsigned int s=1; s < blockDim.x; s *= 2) {
   int index = 2 * s * tid;
   if (index < blockDim.x) {
      sdata[index] += sdata[index + s];
   }
   __syncthreads();
}

Step2: 连续寻址

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
   if (tid < s) {
      sdata[tid] += sdata[tid + s];
   }
   __syncthreads();
}

原本寻址

现在寻址有一边连续了

Step3 : 弥补浪费的线程

方法：在load SMEM的时候提前做一次规约加法，通过减少一半的block数，将原本两个block里的值load+add存储在sum里。

// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();

step4 : Unrolling the Last Warp

当s< 32的时候，就只有一个Warp工作了。

使用warp的SIMD还省去了__syncthreads()的麻烦

for (unsigned int s=blockDim.x/2; s>32; s>>=1) 
{
   if (tid < s)
      sdata[tid] += sdata[tid + s];
   __syncthreads();
}
if (tid < 32)
{
   sdata[tid] += sdata[tid + 32]; 
   sdata[tid] += sdata[tid + 16]; 
   sdata[tid] += sdata[tid + 8]; 
   sdata[tid] += sdata[tid + 4]; 
   sdata[tid] += sdata[tid + 2]; 
   sdata[tid] += sdata[tid + 1]; 
}

为了保持整洁，最后一个if还做了无效的计算。eg, Warp里的最后一个线程只有第一句命令有用。

Step5 : 根据blockSize完全展开for和去除代码

由于for循环里是二分的，而且小于32的单独处理了，导致for循环里实际运行代码最多就3句。

利用代码模板和编译器的自动优化实现：

1 2	template <unsigned int blockSize> __global__ void reduce5(int g_idata, int g_odata)

红色代码会在编译时自动优化。

step6 ：归并算法优化

加速级联？？

Cost= processors × time complexity

我们知道N个元素直接二叉树归约是O(log N)
时间 Cost=N*O(log N).

但是假如只有P个线程先做N/P的串行加法, 然后是log(P)的归约。
总cost=P(N/P+log(P))

当P=N/log(N), cost=O(N)

each thread should sum O(log n) elements来设置

比如，1024 or 2048 elements per block vs. 256 线程。每个sum n=4个元素。具体参数要perf

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
   sdata[tid] += g_idata[i] + g_idata[i+blockSize];
   i += gridSize;
}
__syncthreads();

final code

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
   extern __shared__ int sdata[];

   unsigned int tid = threadIdx.x;
   unsigned int i = blockIdx.x*(blockSize*2) + tid;
   unsigned int gridSize = blockSize*2*gridDim.x;
   sdata[tid] = 0;

   do { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } while (i < n);
   __syncthreads();

   if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
   if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
   if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

   if (tid < 32) {
      if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
      if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
      if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
      if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
      if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
      if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
   }
   if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

关于if语句的补充

有if语句是没问题的，只要运行的时候全部执行if或者else就行。不要有些执行if，有些执行else，这才会等待。

说不定也不是全部执行if或者else就行，只需要连续32个Thread Index，是相同的执行就行。（猜想，需要测试。

关于延迟隐藏

通过增加block里的线程数，并且同时读取来隐藏延迟。不仅可以隐藏Global Memory的延迟，还可以隐藏写后读的延迟

线程资源查看

线程太多会导致分配到每一个的寄存器和SMEM变少

通过编译时加-cubin选项，.cubin文件前几行会显示

architecture {sm_10}
abiversion {0}
modname {cubin}
code {
   name = BlackScholesGPU
   lmem = 0    # per thread local memory
   smem = 68   # per thread block shared memory
   reg = 20    # per thread registers

参考文献

[^1]: SC07 Optimizing Parallel Reduction in CUDA - Mark Harris

[^2]: 2009 清华邓仰东 cuda lecture pdf 注意也是参考的SC07 Nvidia。

Posted 2022-05-12Updated 2026-02-18Tutorials3 minutes read (About 424 words)

Nvprof

安装

1 2	$ which nvprof /usr/local/cuda/bin/nvprof

基本使用

摘要模式

命令行直接运行

1	nvprof ./myApp

跟踪API

1	nvprof --print-gpu-trace ./myApp

保存在log里

1	sudo /usr/local/cuda/bin/nvprof --log-file a.log --metrics achieved_occupancy /staff/shaojiemike/github/cutests/22-commonstencil/common

可视化

nsight可以直接在远程机器上运行
1. ssh -X host
2. .ssh/config
  1. add
  2. XAuthLocation /opt/X11/bin/xauth #for macbookAir
  3. ForwardX11Trusted yes
  4. ForwardX11 yes
Visual Profiler也可以ssh直接连接远程机器
或者导出分析结果以便可视化, 在Visual Profiler使用

1 2	nvprof --export-profile timeline.prof <app> <app args> nvprof --analysis-metrics -o nbody-analysis.nvprof ./myApp

profile kernel

1	sudo /usr/local/cuda/bin/ncu -k stencil_kernel -s 0 -c 1 /staff/shaojiemike/github/cutests/22-commonstencil/best

ncu-ui是可视化界面，但是没弄懂

带宽profile

上限测量

# shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:02:08]                                                                                                                                                                      $ ./bin/x86_64/linux/release/bandwidthTest                                                                                                                                                                                           [CUDA Bandwidth Test] - Starting...                                                                                                                                                                                                  Running on...                                                                                                                                                                                                                                                                                                                                                                                                                                                              Device 0: Tesla P40                                                                                                                                                                                                                  Quick Mode                                                                                                                                                                                                                                                                                                                                                                                                                                                                Host to Device Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     11.8                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Host Bandwidth, 1 Device(s)                                                                                                                                                                                                PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     13.0                                                                                                                                                                                                                                                                                                                                                                                                                                       Device to Device Bandwidth, 1 Device(s)                                                                                                                                                                                              PINNED Memory Transfers                                                                                                                                                                                                                Transfer Size (Bytes)        Bandwidth(GB/s)                                                                                                                                                                                         32000000                     244.3                                                                                                                                                                                                                                                                                                                                                                                                                                     Result = PASS                                                                                                                                                                                                                                                                                                                                                                                                                                                             NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.                                                                                                                                                                                       # shaojiemike @ snode0 in ~/github/cuda-samples-11.0 [16:03:24]                                                                                                                                                                      $ ./bin/x86_64/linux/release/p2pBandwidthLatencyTest

实际值

nvprof通过指定与dram，L1或者L2 的metrics来实现。具体解释可以参考官网

在 Maxwell 和之后的架构中 L1 和 SMEM 合并

Metric Name	解释
achieved_occupancy	活跃cycle是 Warps 活跃的比例
dram_read_throughput
dram_utilization	在0到10的范围内，相对于峰值利用率，设备内存的利用率水平
shared_load_throughput
shared_utilization
l2_utilization

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2022-05-06Updated 2026-02-18toLearna few seconds read (About 41 words)

Php

enable php curl extension

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2022-05-03Updated 2026-02-18Tutorialsa minute read (About 149 words)

Hugo

Hugo is a Go-based static site generator known for its speed and flexibility in 2013.
Hugo has set itself apart by being fast. More precisely, it has set itself apart by being much faster than Jekyll.
Jekyll uses Liquid as its templating language. Hugo uses Go templating. Most people seem to agree that it is a little bit easier to learn Jekyll’s syntax than Hugo’s.^1

Posted 2022-04-27Updated 2026-02-18Thinking40 minutes read (About 6021 words)

Presentation & Visualization : PPT

导言

学术分享：

目标：让读者理解原理。

工作汇报：

目标：听者听懂、明白背景、工作的要点难点、明确阶段成果
流程：STAR法则组织。

Posted 2022-04-23Updated 2026-02-18network10 minutes read (About 1530 words)

Tcpdump & wireshark

命令行查看当前机器公网ip

1 2	> curl myip.ipip.net 当前 IP：117.136.101.72 来自于：中国安徽移动

检测机器端口开放

# 网页服务直接下载检查内容
wget 4.shaojiemike.top:28096
# -z 选项指示 nc 仅扫描打开的端口，而不发送任何数据，并且 -v 用于获取更多详细信息。
nc -z -v 4.shaojiemike.top 28096

或者扫描指定端口

# IPV6 也行
$ nmap -6 -p 8096 2001:da8:d800:611:5464:f7ab:9560:a646
Starting Nmap 7.80 ( https://nmap.org ) at 2023-01-04 19:33 CST
Nmap scan report for 2001:da8:d800:611:5464:f7ab:9560:a646
Host is up (0.00099s latency).

PORT     STATE SERVICE
8096/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds

$ nmap -p 28096 4.shaojiemike.top
Starting Nmap 7.80 ( https://nmap.org ) at 2023-01-04 19:19 CST
Nmap scan report for 4.shaojiemike.top (114.214.181.97)
Host is up (0.0011s latency).

PORT      STATE SERVICE
28096/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.05 seconds

全部端口，但是会很慢。50分钟

1	sudo nmap -sT -p- 4.shaojiemike.top

wireshark

显示过滤

上方的过滤窗口

tcp.port==80&&(ip.dst==192.168.1.2||ip.dst==192.168.1.3)

ip.addr ==192.168.1.1 //显示所有目标或源地址是192.168.1.1的数据包
eth.addr== 80:f6:2e:ce:3f:00 //根据MAC地址过滤，详见“wireshark过滤MAC地址/物理地址”
tcp.port==23

捕捉过滤

抓包前在capture option中设置，仅捕获符合条件的包，可以避免产生较大的捕获文件和内存占用，但不能完整的复现测试时的网络环境。

1
2
3

host 192.168.1.1 //抓取192.168.1.1 收到和发出的所有数据包
src host 192.168.1.1 //源地址，192.168.1.1发出的所有数据包
dst host 192.168.1.1 //目标地址，192.168.1.1收到的所有数据包

color 含义

tcpdump

传统命令行抓包工具

常用参数

注意过滤规则间的and

-nn :
1. 单个 n 表示不解析域名，直接显示 IP；
2. 两个 n 表示不解析域名和端口。
3. 方便查看 IP 和端口号，
4. 不需要域名解析会非常高效。
-i 指定网卡 -D查看网卡
-v，-vv 和 -vvv 来显示更多的详细信息
port 80 抓取 80 端口上的流量，通常是 HTTP。在前面加src,dst限定词
1. tcpudmp -i eth0 -n arp host 192.168.199 抓取192.168.199.* 网段的arp协议包，arp可以换为tcp,udp等。
-A,-X,-xx会逐渐显示包内容更多信息
-e : 显示数据链路层信息。
1. 默认情况下 tcpdump 不会显示数据链路层信息，使用 -e 选项可以显示源和目的 MAC 地址，以及 VLAN tag 信息。

输出说明

1	192.168.1.106.56166 > 124.192.132.54.80

ip 是 192.168.1.106，源端口是 56166，
目的地址是 124.192.132.54，目的端口是 80。
> 符号代表数据的方向。

Flags

常见的三次握手 TCP 报文的 Flags:

[S] : SYN（开始连接）
[.] : 没有 Flag
[P] : PSH（推送数据）
[F] : FIN （结束连接）
[R] : RST（重置连接）

常见用途

根据目的IP，筛选网络经过的网卡和端口
能抓各种协议的包比如ping，ssh

案例分析

1	curl --trace-ascii - www.github.com

github ip 为 20.205.243.166

ifconfig显示 ibs5的网卡有21TB的带宽上限，肯定是IB卡了。

sudo tcpdump -i ibs5 '((tcp) and (host 20.205.243.166))'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ibs5, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
15:53:53.848619 IP snode0.59878 > 20.205.243.166.http: Flags [S], seq 879685062, win 64128, options [mss 2004,sackOK,TS val 4096492456 ecr 0,nop,wscale 7], length 0
15:53:53.952705 IP 20.205.243.166.http > snode0.59878: Flags [S.], seq 1917452372, ack 879685063, win 65535, options [mss 1436,sackOK,TS val 1127310087 ecr 4096492456,nop,wscale 10], length 0
15:53:53.952728 IP snode0.59878 > 20.205.243.166.http: Flags [.], ack 1, win 501, options [nop,nop,TS val 4096492560 ecr 1127310087], length 0
15:53:53.953208 IP snode0.59878 > 20.205.243.166.http: Flags [P.], seq 1:79, ack 1, win 501, options [nop,nop,TS val 4096492561 ecr 1127310087], length 78: HTTP: GET / HTTP/1.1
15:53:54.058654 IP 20.205.243.166.http > snode0.59878: Flags [P.], seq 1:89, ack 79, win 64, options [nop,nop,TS val 1127310193 ecr 4096492561], length 88: HTTP: HTTP/1.1 301 Moved Permanently
15:53:54.058668 IP snode0.59878 > 20.205.243.166.http: Flags [.], ack 89, win 501, options [nop,nop,TS val 4096492666 ecr 1127310193], length 0
15:53:54.059092 IP snode0.59878 > 20.205.243.166.http: Flags [F.], seq 79, ack 89, win 501, options [nop,nop,TS val 4096492667 ecr 1127310193], length 0
15:53:54.162608 IP 20.205.243.166.http > snode0.59878: Flags [F.], seq 89, ack 80, win 64, options [nop,nop,TS val 1127310297 ecr 4096492667], length 0

$ sudo tcpdump -i ibs5 -nn -vvv -e '((port 80) and (tcp) and (host 20.205.243.166))'                                                                                                                                                 tcpdump: listening on ibs5, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
16:09:38.743478 Out ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 15215, offset 0, flags [DF], proto TCP (6), length 60)
    10.1.13.50.38376 > 20.205.243.166.80: Flags [S], cksum 0x1fd5 (incorrect -> 0x98b6), seq 1489092902, win 64128, options [mss 2004,sackOK,TS val 4097437351 ecr 0,nop,wscale 7], length 0
16:09:38.848164  In ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 48, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    20.205.243.166.80 > 10.1.13.50.38376: Flags [S.], cksum 0x69ba (correct), seq 3753100548, ack 1489092903, win 65535, options [mss 1436,sackOK,TS val 3712395681 ecr 4097437351,nop,wscale 10], length 0
16:09:38.848212 Out ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 15216, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.13.50.38376 > 20.205.243.166.80: Flags [.], cksum 0x1fcd (incorrect -> 0x9613), seq 1, ack 1, win 501, options [nop,nop,TS val 4097437456 ecr 3712395681], length 0
16:09:38.848318 Out ethertype IPv4 (0x0800), length 146: (tos 0x0, ttl 64, id 15217, offset 0, flags [DF], proto TCP (6), length 130)
    10.1.13.50.38376 > 20.205.243.166.80: Flags [P.], cksum 0x201b (incorrect -> 0x9f0a), seq 1:79, ack 1, win 501, options [nop,nop,TS val 4097437456 ecr 3712395681], length 78: HTTP, length: 78
        GET / HTTP/1.1
        Host: www.github.com
        User-Agent: curl/7.68.0
        Accept: */*

16:09:38.954152  In ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 48, id 45056, offset 0, flags [DF], proto TCP (6), length 140)
    20.205.243.166.80 > 10.1.13.50.38376: Flags [P.], cksum 0x024d (correct), seq 1:89, ack 79, win 64, options [nop,nop,TS val 3712395786 ecr 4097437456], length 88: HTTP, length: 88
        HTTP/1.1 301 Moved Permanently
        Content-Length: 0
        Location: https://www.github.com/

16:09:38.954207 Out ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 15218, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.13.50.38376 > 20.205.243.166.80: Flags [.], cksum 0x1fcd (incorrect -> 0x949a), seq 79, ack 89, win 501, options [nop,nop,TS val 4097437562 ecr 3712395786], length 0
16:09:38.954884 Out ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 15219, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.13.50.38376 > 20.205.243.166.80: Flags [F.], cksum 0x1fcd (incorrect -> 0x9498), seq 79, ack 89, win 501, options [nop,nop,TS val 4097437563 ecr 3712395786], length 0
16:09:39.060177  In ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 48, id 45057, offset 0, flags [DF], proto TCP (6), length 52)
    20.205.243.166.80 > 10.1.13.50.38376: Flags [F.], cksum 0x95e2 (correct), seq 89, ack 80, win 64, options [nop,nop,TS val 3712395892 ecr 4097437563], length 0
16:09:39.060221 Out ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 15220, offset 0, flags [DF], proto TCP (6), length 52)
    10.1.13.50.38376 > 20.205.243.166.80: Flags [.], cksum 0x1fcd (incorrect -> 0x93c4), seq 80, ack 90, win 501, options [nop,nop,TS val 4097437668 ecr 3712395892], length 0
16:09:46.177269 Out ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 38621, offset 0, flags [DF], proto TCP (6), length 60)

snode0 ip 是 10.1.13.50

traceroute

mtr = traceroute+ping

$ traceroute www.baid.com
traceroute to www.baidu.com (182.61.200.6), 30 hops max, 60 byte packets                                                                                                                                                           
1  acsa-nfs (10.1.13.1)  0.179 ms  0.180 ms  0.147 ms                                                                                                                                                                            
2  192.168.252.1 (192.168.252.1)  2.016 ms  1.954 ms  1.956 ms                                                                                                                                                                   
3  202.38.75.254 (202.38.75.254)  4.942 ms  3.941 ms  4.866 ms

traceroute命令用于显示数据包到主机间的路径。

NETWORKMANAGER 管理

# shaojiemike @ snode0 in /etc/NetworkManager [16:49:55]
$ nmcli general status
STATE         CONNECTIVITY  WIFI-HW  WIFI     WWAN-HW  WWAN
disconnected  unknown       enabled  enabled  enabled  enabled

# shaojiemike @ snode0 in /etc/NetworkManager [16:50:40]
$ nmcli connection show
NAME                     UUID                                  TYPE        DEVICE
InfiniBand connection 1  7edf4eea-0591-48ba-868a-e66e8cb720ce  infiniband  --

好像之前使用过的样子。

# shaojiemike @ snode0 in /etc/NetworkManager [16:56:36] C:127
$ service network-manager status
● NetworkManager.service - Network Manager
     Loaded: loaded (/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-03-14 11:52:06 CST; 1 months 10 days ago
       Docs: man:NetworkManager(8)
   Main PID: 1339 (NetworkManager)
      Tasks: 3 (limit: 154500)
     Memory: 12.0M
     CGroup: /system.slice/NetworkManager.service
             └─1339 /usr/sbin/NetworkManager --no-daemon

Warning: some journal files were not opened due to insufficient permissions.

应该是这个 Secure site-to-site connection with Linux IPsec VPN 来设置的

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

FJW说所有网络都是通过NFS一起出去的

参考文献

无