VtuneOptimize

vtune的安装和profile

使用

由于snode0有sudo

1
2
source /opt/intel/oneapi/setvars.sh
sudo vtune-gui

sudo后图形化界面 MobaXterm打不开的原因参考这个

Step1 : Performance Snapshot 参数说明

以IPCC2022 初赛 支撑点计算的baseline为例

Logical Core Utilization

1
2
Effective Logical Core Utilization: 3.8% (2.436 out of 64)
Effective Physical Core Utilization: 6.4% (2.053 out of 32)

CPU利用率主要是指计算有效占比。为100%意味着所有逻辑CPU都是由应用程序的计算占用。

Microarchitecture Usage

微架构使用指标是一个关键指标,可以帮助评估(以%为单位)你的代码在当前微架构上运行的效率。

微架构的使用可能会受到

  1. long-latency memory长延迟访存、
  2. floating-point, or SIMD operations浮点或SIMD操作的影响;
  3. non-retired instructions due to branch mispredictions;由于分支错误预测导致的未退役指令;
  4. instruction starvation in the front-end.前端指令不足。

vtune的建议

1
2
3
4
5
6
7
Microarchitecture Usage: 37.7% of Pipeline Slots
Retiring: 37.7%
Front-End Bound: 16.9%
Back-End Bound: 23.8%
Memory Bound: 11.9%
Core Bound: 11.9%
Bad Speculation: 21.5%

针对Back-End Bound: 23.8%的建议如下:

A significant portion of pipeline slots are remaining empty.
(??? 他是指有23.8% empty还是被使用了呢)

When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable to support.

This opportunity cost results in slower execution.

  1. Long-latency operations like divides and memory operations can cause this,
  2. as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

针对Bad Speculation: 21.5%的建议如下:

A significant proportion of pipeline slots containing 21.5% useful work are being cancelled.

This can be caused by mispredicting branches or by machine clears. Note that this metric value may be highlighted due to Branch Resteers issue.

Retiring metric

Retiring metric represents a Pipeline Slots fraction utilized by useful work, meaning the issued uOps that eventually get retired.
Retiring metric 表示有用工作所使用的Pipeline slot流水线管道的比例,所有发射的uOps最终都会retired。

Ideally, all Pipeline Slots would be attributed to the Retiring category.
理想情况下,所有的管道槽都应该归于退休类别。

Retiring of 100% would indicate the maximum possible number of uOps retired per cycle has been achieved. 100%的退役表明每个周期内退役的uop数量达到了可能的最大值。

Maximizing Retiring typically increases the Instruction-Per-Cycle metric.
最大化Retiring通常会增加IPC。

Note that a high Retiring value does not necessary mean no more room for performance improvement.
For example, Microcode assists are categorized under Retiring. They hurt performance and can often be avoided.

Microcode assists根据Intel的解释是

当遇到特殊的计算(比如处理非常小的浮点值(所谓的逆法线)时),浮点单元并没有被设置为本机执行这些操作。为此需要在指令流中插入可能有数百个指令长的小程序,对性能会造成很大的影响。

Front-End Bound

Front-End Bound metric represents a slots fraction where the processor’s Front-End undersupplies its Back-End. 该指标表示前端产生的指令是否足以支持后端处理。

Front-End denotes the first part of the processor core responsible for fetching operations that are executed later on by the Back-End part. 前端将指令分解成uops供后端处理。

Within the Front-End, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uOps). 在前端中,分支预测器预测下一个要获取的地址,缓存行从内存子系统中获取,解析为指令,最后解码为微操作(uOps)。

Front-End Bound metric denotes unutilized issue-slots when there is no Back-End stall (bubbles where Front-End delivered no uOps while Back-End could have accepted them). For example, stalls due to instruction-cache misses would be categorized as Front-End Bound

Front-End Bound指标表示当后端没有停顿时未使用的发射槽(bubbles: 前端没有交付uOps,而发射给后端的)。例如,由于指令缓存未命中而导致的暂停将被归类为Front-End Bound

Back-End Bound

metric represents a Pipeline Slots fraction where no uOps are being delivered due to a lack of required resources for accepting new uOps in the Back-End. 该指标表示后端uops是否出现了因为硬件资源紧张而无法处理的问题。

Back-End is the portion of the processor core where an out-of-order scheduler dispatches ready uOps into their respective execution units, and, once completed, these uOps get retired according to the program order. 后端的乱序执行,顺序Reire模型。

For example, stalls due to data-cache misses or stalls due to the divider unit(除法器?) being overloaded are both categorized as Back-End Bound. Back-End Bound is further divided into two main categories: Memory Bound and Core Bound.

Memory Bound

This metric shows how memory subsystem issues affect the performance. Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressure on the pipeline.

Core Bound

This metric represents how much Core non-memory issues were of a bottleneck. 表明核心的非内存原因成为了瓶颈

  1. Shortage in hardware compute resources, 硬件资源的短缺
  2. or dependencies software’s instructions are both categorized under Core Bound. 指令间的依赖

Hence it may indicate

  1. the machine ran out of an OOO resources,
  2. certain execution units are overloaded
  3. or dependencies in program’s data- or instruction- flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).

Bad Speculation(分支预测错误)

represents a Pipeline Slots fraction wasted due to incorrect speculations.

This includes slots used to issue uOps that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from an earlier incorrect speculation.

For example, wasted work due to mispredicted branches is categorized as a Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.

这里的Nukes, 猜测是数据预取预测错误,带来的访存影响像核爆一样大吧.

Memory Bound

1
2
3
4
5
6
7
Memory Bound: 11.9% of Pipeline Slots
L1 Bound: 7.9%
L2 Bound: 0.2%
L3 Bound: 2.5%
DRAM Bound: 2.0%
Store Bound: 0.3%
NUMA: % of Remote Accesses: 13.2%

This metric shows how memory subsystem issues affect the performance. Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. 该项表明了有多少流水线的slots因为load或者store指令的需求而被迫等待

This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation
这是指不连续访存吗?

in addition to less common cases where stores could imply back-pressure on the pipeline.

L1 Bound

This metric shows how often machine was stalled without missing the L1 data cache.
在不发生L1 miss的情况下,指令stall的频率。(因为其他原因导致stall?)

The L1 cache typically has the shortest latency. However, in certain cases like loads blocked on older stores, a load might suffer a high latency even though it is being satisfied by the L1. 假设load了一个刚store的值,load指令也会遇到很大的延迟。

L2 Bound

This metric shows how often machine was stalled on L2 cache. Avoiding cache misses (L1 misses/L2 hits) will improve the latency and increase performance.

L3 Bound

This metric shows how often CPU was stalled on L3 cache, or contended with a sibling Core(与兄弟姐妹核竞争). Avoiding cache misses (L2 misses/L3 hits) improves the latency and increases performance.

DRAM Bound

This metric shows how often CPU was stalled on the main memory (DRAM). Caching typically improves the latency and increases performance.

DRAM Bandwidth Bound

This metric represents percentage of elapsed time the system spent with high DRAM bandwidth utilization. Since this metric relies on the accurate peak system DRAM bandwidth measurement, explore the Bandwidth Utilization Histogram and make sure the Low/Medium/High utilization thresholds are correct for your system. You can manually adjust them, if required.

Store Bound

This metric shows how often CPU was stalled on store operations. Even though memory store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls.

NUMA: % of Remote Accesses

In NUMA (non-uniform memory architecture) machines, memory requests missing LLC may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. It is recommended to keep as much frequently accessed data local as possible. This metric shows percent of remote accesses, the lower the better.

可以用之前的

Vectorization

This metric represents the percentage of packed (vectorized) floating point operations. 0% means that the code is fully scalar. The metric does not take into account the actual vector length that was used by the code for vector instructions. So if the code is fully vectorized and uses a legacy instruction set that loaded only half a vector length, the Vectorization metric shows 100%.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Vectorization: 23.7% of Packed FP Operations
Instruction Mix:
SP FLOPs: 0.9%
Packed: 99.9%
128-bit: 0.1%
256-bit: 99.8%
512-bit: 0.0%
Scalar: 0.1%
DP FLOPs: 2.9%
Packed: 0.0%
Scalar: 100.0%
x87 FLOPs: 0.0%
Non-FP: 96.2%
FP Arith/Mem Rd Instr. Ratio: 0.091
FP Arith/Mem Wr Instr. Ratio: 0.308

针对Vectorization: 23.7%的建议

A significant fraction of floating point arithmetic instructions are scalar. Use Intel Advisor to see possible reasons why the code was not vectorized.

SP FLOPs

The metric represents the percentage of single precision floating point operations from all operations executed by the applications. Use the metric for rough estimation of a SP FLOP fraction. If FMA vector instructions are used the metric may overcount.

X87 FLOPs

The metric represents the percentage of x87 floating point operations from all operations executed by the applications. Use the metric for rough estimation of an x87 fraction. If FMA vector instructions are used the metric may overcount.

X87是X86体系结构指令集的浮点相关子集。 它起源于8086指令的扩展,以可选的浮点协处理器的形式与相应的x86 cpus配合使用。 这些微芯片的名称在“ 87”中结尾。

FP Arith/Mem Rd Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory write instructions. A value less than 0.5 indicates unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution.

小于0.5的值表示向量操作的未对齐数据访问,这可能会对矢量指令执行的性能产生负面影响。

Step2 : Hotspots

User-Mode Sampling只能采集单核的数据,来分析算法的优化。

Hardware Event-Based Sampling硬件时间采集能采集全部核心,但是要少于几秒钟?

这个硬件采集慢,而且到一半报错了,发生什么事了?

网上说是root权限的原因,但是我是用root运行的

反而用普通用户能正常跑Hardware Event-Based Sampling和微架构分析

example


手动向量化该区域。

核心时间是 $k*n^2$ 次绝对值和,取最大值

优化思路:

  1. 手动向量化(假设一次处理p个)

    第一个n层取出 k个 rebuilt[i*k+ki] 重复读取到向量寄存器里,

    第二个n层取出k 个 连续的p个,到向量寄存器里。最后不足补0特殊处理,但是一般n都是4的倍数,可能可以不处理。8就要处理了。

    做向量fabs的结果缓存在k个向量寄存器里。

    再对这个k个向量寄存器做横向的向量最大值操作到一个向量寄存器。不足的补0(取最大值不影响)

    最后这一个向量寄存器做寄存器内求和,再加到 chebyshevSum 里.

    这样就实现了p个元素的向量操作。这样一趟共需要3*k个向量寄存器。

  2. 手动数据预取

    1. __builtin_prefetch()
  3. 手动循环展开形成计算访存流水

    1. 怎么根据输入来规模来展开?
  4. 分块

访存分析

github对应项目与赛题

HPL-PL

复现机器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 36
On-line CPU(s) list: 0-35
Thread(s) per core: 1
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 1296.157
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 4199.98
Virtualization: VT-x
L1d cache: 1.1 MiB
L1i cache: 1.1 MiB
L2 cache: 9 MiB
L3 cache: 90 MiB

baseline

1
2
3
4
5
6
7
8
9
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ gcc -std=c11 conway.c -o Conway
$ ./Conway
……
Iter 997...
Iter 998...
Iter 999...
136527.433000 ms

优化步骤

由于O3和并行会导致热点代码不可读

在可迭代优化的例子下,根据vtune最大化单核性能。

很明显不是计算密集的应用,怎么形成流水最大化带宽利用,划分重复利用元素提高Cache命中率是重点(向量化对计算加速明显)


  1. 替换if tmp[i][j] = (!(cnt^3))||((a[i][j]&1)&&(!(cnt^4)));
  2. 去除中间不必要的拷贝
  3. int 变 char
  4. OMP_PROC_BIND=true 绑定线程到对应local处理器和对应local内存

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

  1. 实验室同学黄业琦参加了HPC-PL全明星。想复现一下效果
  2. 之前Nvidia Nsight用得很爽, 想到vtune的访存优化部分和汇编对应的分析,使用的很少。想从提高计算流水和访存连续流水的角度结合vtune优化。

参考文献

Vtune Assembly Analysis

超算机器用vtune的命令行文件分析

  1. 首先找到vtune程序

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    > module load intel/2022.1                                    
    > which icc
    /public1/soft/oneAPI/2022.1/compiler/latest/linux/bin/intel64/icc
    > cd /public1/soft/oneAPI/2022.1
    > find . -executable -type f -name "*vtune*"
    ./vtune/2022.0.0/bin64/vtune-worker-crash-reporter
    ./vtune/2022.0.0/bin64/vtune-gui.desktop
    ./vtune/2022.0.0/bin64/vtune-gui
    ./vtune/2022.0.0/bin64/vtune-agent
    ./vtune/2022.0.0/bin64/vtune-self-checker.sh
    ./vtune/2022.0.0/bin64/vtune-backend
    ./vtune/2022.0.0/bin64/vtune-worker
    ./vtune/2022.0.0/bin64/vtune
    ./vtune/2022.0.0/bin64/vtune-set-perf-caps.sh
  2. vtune-gui获取可执行命令

    1
    /opt/intel/oneapi/vtune/2021.1.1/bin64/vtune -collect hotspots -knob enable-stack-collection=true -knob stack-size=4096 -data-limit=1024000 -app-working-dir /home/shaojiemike/github/IPCC2022first/build/bin -- /home/shaojiemike/github/IPCC2022first/build/bin/pivot /home/shaojiemike/github/IPCC2022first/src/uniformvector-2dim-5h.txt
  3. 编写sbatch_vtune.sh

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    #!/bin/bash
    #SBATCH -o ./slurmlog/job_%j_rank%t_%N_%n.out
    #SBATCH -p IPCC
    #SBATCH -t 15:00
    #SBATCH --nodes=1
    #SBATCH --exclude=
    #SBATCH --cpus-per-task=64
    #SBATCH --mail-type=FAIL
    #SBATCH [email protected]

    source /public1/soft/modules/module.sh
    module purge

    module load intel/2022.1

    logname=vtune
    export OMP_PROC_BIND=close; export OMP_PLACES=cores
    # ./pivot |tee ./log/$logname
    /public1/soft/oneAPI/2022.1/vtune/2022.0.0/bin64/vtune -collect hotspots -knob enable-stack-collection=true -knob stack-size=4096 -data-limit=1024000 -app-working-dir /public1/home/ipcc22_0029/shaojiemike/slurm -- /public1/home/ipcc22_0029/shaojiemike/slurm/pivot /public1/home/ipcc22_0029/shaojiemike/slurm/uniformvector-2dim-5h.txt |tee ./log/$logname
  4. log文件如下,但是将生成的trace文件r000hs导入识别不了AMD

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    > cat log/vtune
    dim = 2, n = 500, k = 2
    Using time : 452.232000 ms
    max : 143 351 58880.823709
    min : 83 226 21884.924801
    Elapsed Time: 0.486s
    CPU Time: 3.540s
    Effective Time: 3.540s
    Spin Time: 0s
    Overhead Time: 0s
    Total Thread Count: 8
    Paused Time: 0s

    Top Hotspots
    Function Module CPU Time % of CPU Time(%)
    --------------- ------ -------- ----------------
    SumDistance pivot 0.940s 26.6%
    _mm256_add_pd pivot 0.540s 15.3%
    _mm256_and_pd pivot 0.320s 9.0%
    _mm256_loadu_pd pivot 0.300s 8.5%
    Combination pivot 0.250s 7.1%
    [Others] N/A 1.190s 33.6%

汇编

1
2
objdump -Sd ../build/bin/pivot > pivot1.s
gcc -S -O3 -fverbose-asm ../src/pivot.c -o pivot_O1.s

汇编分析技巧

https://blog.csdn.net/thisinnocence/article/details/80767776

如何设置GNU和Intel汇编语法


vtune汇编实例

(没有开O3,默认值)


偏移 -64 是k

-50 是ki


CDQE复制EAX寄存器双字的符号位(bit 31)到RAX的高32位。

这里的movsdq的q在intel里的64位,相当于使用了128位的寄存器,做了64位的事情,并没有自动向量化。

生成带代码注释的O3汇编代码

如果想把 C 语言变量的名称作为汇编语言语句中的注释,可以加上 -fverbose-asm 选项:

1
gcc -S -O3 -fverbose-asm ../src/pivot.c -o pivot_O1.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
.L15:
# ../src/pivot.c:38: double dis = fabs(rebuiltCoordFirst - rebuiltCoordSecond);
movsd (%rax), %xmm0 # MEM[base: _15, offset: 0B], MEM[base: _15, offset: 0B]
subsd (%rax,%rdx,8), %xmm0 # MEM[base: _15, index: _21, step: 8, offset: 0B], tmp226
addq $8, %rax #, ivtmp.66
# ../src/pivot.c:38: double dis = fabs(rebuiltCoordFirst - rebuiltCoordSecond);
andpd %xmm2, %xmm0 # tmp235, dis
maxsd %xmm1, %xmm0 # chebyshev, dis
movapd %xmm0, %xmm1 # dis, chebyshev
# ../src/pivot.c:35: for(ki=0; ki<k; ki++){
cmpq %rax, %rcx # ivtmp.66, _115
jne .L15 #,
.L19:
# ../src/pivot.c:32: for(j=i+1; j<n; j++){
addl $1, %esi #, j
# ../src/pivot.c:41: chebyshevSum += chebyshev;
addsd %xmm1, %xmm4 # chebyshev, <retval>
addl %r14d, %edi # k, ivtmp.75
# ../src/pivot.c:32: for(j=i+1; j<n; j++){
cmpl %esi, %r15d # j, n
jg .L13 #,
# ../src/pivot.c:32: for(j=i+1; j<n; j++){
addl $1, %r10d #, j
# ../src/pivot.c:32: for(j=i+1; j<n; j++){
cmpl %r10d, %r15d # j, n
jne .L16 #,

vtune O3汇编分析

原本以为O3是看不了原代码与汇编的对应关系的,但实际可以-g -O3 是不冲突的。

指令的精简合并

  1. 访存指令的合并
    1. r9 mov到 rax里,
      1. leaq (%r12,%r8,8), %r9。其中r12rebuiltCoord,所以r8原本存储的是[i*k]的值
      2. raxrebuiltCoord+[i*k]的地址,由于和i有关,index的计算在外层就计算好了。
    2. rdx的值减去r8存储在rdx
      1. rdx原本存储的是[j*k]的地址
      2. r8原本存储的是[i*k]的值
      3. rdx之后存储的是[(j-i)*k]的地址
    3. data16 nop是为了对齐插入的nop
    1. 值得注意的是取最大值操作,这里变成了maxsd
    2. xmm0缓存值
    3. xmm1chebyshev
    4. xmm2fabs的掩码
    5. xmm4chebyshevSum

自动循环展开形成流水

    1. r14d存储k的值,所以edi存储j*k
    2. Block22后的指令验证了rdx原本存储的是[j*k]的地址
    1. 最外层循环
    2. 因为r14d存储k的值,r8r11d存储了i*k的值

从汇编看不出有该操作,需要开启编译选项

自动向量化

从汇编看不出有该操作,需要开启编译选项

自动数据预取

从汇编看不出有该操作,需要开启编译选项

问题

为什么求和耗时这么多

添加向量化选项

gcc

Baseline

-mavx2 -march=core-avx2

  1. 阅读文档, 虽然全部变成了vmov,vadd的操作,但是实际还是64位的工作。
    1. 这点add rax, 0x8没有变成add rax, 0x16可以体现
    2. 但是avx2不是256位的向量化吗?用的还是xmm0这类的寄存器。
1
2
3
4
5
6
7
8
VADDSD (VEX.128 encoded version)
DEST[63:0] := SRC1[63:0] + SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

ADDSD (128-bit Legacy SSE version)
DEST[63:0] := DEST[63:0] + SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

-march=skylake-avx512

汇编代码表面没变,但是快了10s(49s - 39s)

下图是avx2的

下图是avx512的

猜测注意原因是

  1. nop指令导致代码没对齐
  2. 不太可能和红框里的代码顺序有关

添加数据预取选项

判断机器是否支持

1
2
lscpu|grep pref
3dnowprefetch //3DNow prefetch instructions

应该是支持的

汇编分析

虽然时间基本没变,主要是对主体循环没有进行预取操作,对其余循环(热点占比少的)有重新调整。如下图增加了预取指令

添加循环展开选项

变慢很多(39s -> 55s)

-funroll-loops

汇编实现,在最内层循环根据k的值直接跳转到对应的展开块,这里k是2。

默认是展开了8层,这应该和xmm寄存器总数有关

分析原因

  1. 循环展开的核心是形成计算和访存的流水
    1. 不是简单的少几个跳转指令
    2. 这种简单堆叠循环核心的循环展开,并不能形成流水。所以时间不会减少
  2. 但是完全无法解释循环控制的时间增加
    1. 比如图中cmp的次数应该减半了,时间反而翻倍了

手动分块

由于数据L1能全部存储下,没有提升

手动数据预取

并没有形成想象中预取的流水。每512位取,还有重复。


每次预取一个Cache Line,后面两条指令预取的数据还有重复部分(导致时间增加 39s->61s)


想预取全部,循环每次预取了512位=64字节

手动向量化

avx2

(能便于编译器自动展开来使用所有的向量寄存器,avx2

39s -> 10s -> 8.4s 编译器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
for(i=0; i<n-blockSize; i+=blockSize){
for(j=i+blockSize; j<n-blockSize; j+=blockSize){
for(ii=i; ii<i+blockSize; ii++){
__m256d vi1 = _mm256_broadcast_sd(&rebuiltCoord[0*n+ii]);
__m256d vi2 = _mm256_broadcast_sd(&rebuiltCoord[1*n+ii]);

__m256d vj11 = _mm256_loadu_pd(&rebuiltCoord[0*n+j]); //读取4个点
__m256d vj12 = _mm256_loadu_pd(&rebuiltCoord[1*n+j]);

__m256d vj21 = _mm256_loadu_pd(&rebuiltCoord[0*n+j+4]); //读取4个点
__m256d vj22 = _mm256_loadu_pd(&rebuiltCoord[1*n+j+4]);

vj11 = _mm256_and_pd(_mm256_sub_pd(vi1,vj11), vDP_SIGN_Mask);
vj12 = _mm256_and_pd(_mm256_sub_pd(vi2,vj12), vDP_SIGN_Mask);

vj21 = _mm256_and_pd(_mm256_sub_pd(vi1,vj21), vDP_SIGN_Mask);
vj22 = _mm256_and_pd(_mm256_sub_pd(vi2,vj22), vDP_SIGN_Mask);

__m256d tmp = _mm256_add_pd(_mm256_max_pd(vj11,vj12), _mm256_max_pd(vj21,vj22));
_mm256_storeu_pd(vchebyshev1, tmp);

chebyshevSum += vchebyshev1[0] + vchebyshev1[1] + vchebyshev1[2] + vchebyshev1[3];

// for(jj=j; jj<j+blockSize; jj++){
// double chebyshev = 0;
// int ki;
// for(ki=0; ki<k; ki++){
// double dis = fabs(rebuiltCoord[ki*n + ii] - rebuiltCoord[ki*n + jj]);
// chebyshev = dis>chebyshev ? dis : chebyshev;
// }
// chebyshevSum += chebyshev;
// }
}
}
}

明明展开了一次,但是编译器继续展开了,总共8次。用满了YMM 16个向量寄存器。

下图是avx512,都出现寄存器ymm26了。

vhaddpd是水平的向量内加法指令

avx512

当在avx512的情况下展开4次,形成了相当工整的代码。

  1. 向量用到了寄存器ymm18,估计只能展开到6次了。
    1. avx2 应该寄存器不够

最后求和的处理,编译器首先识别出了,不需要实际store。还是在寄存器层面完成了计算。并且通过三次add和两次数据 移动指令自动实现了二叉树型求和。

avx2 寄存器不够会出现下面的情况。

avx求和的更快速归约

假如硬件存在四个一起归约的就好了,但是对于底层元件可能过于复杂了。


1
2
__m256d _mm256_hadd_pd (__m256d a, __m256d b);
VEXTRACTF128 __m128d _mm256_extractf128_pd (__m256d a, int offset);


如果可以实现会节约一次数据移动和一次数据add。没有分析两种情况的寄存器依赖。可能依赖长度是一样的,导致优化后时间反而增加一点。

对于int还有这种实现

将横向归约全部提取到外面

并且将j的循环展开变成i的循环展开

手动向量化+手动循环展开?

支持的理由:打破了循环间的壁垒,编译器会识别出无效中间变量,在for的jump指令划出的基本块内指令会乱序执行,并通过寄存器重命名来形成最密集的计算访存流水。

不支持的理由:如果编译器为了形成某一指令的流水,占用了太多资源。导致需要缓存其他结果(比如,向量寄存器不够,反而需要额外的指令来写回,和产生延迟。

理想的平衡: 在不会达到资源瓶颈的情况下展开。

支持的分析例子

手动展开后,识别出来了连续的访存应该在一起进行,并自动调度。将+1的偏移编译器提前计算了。

如果写成macro define,可以发现编译器自动重排了汇编。

不支持的分析例子

avx2可以看出有写回的操作,把值从内存读出来压入栈中。

寄存器足够时没有这种问题

寻找理想的展开次数

由于不同代码对向量寄存器的使用次数不同,不同机器的向量寄存器个数和其他资源数不同。汇编也难以分析。在写好单次循环之后,最佳的展开次数需要手动测量。如下图,6次应该是在不会达到资源瓶颈的情况下展开来获得最大流水。

1
2
3
4
5
6
for(j=beginJ; j<n-jBlockSize; j+=jBlockSize){  /
//展开jBlockSize次
}
for(jj=j; jj<n; jj++){ //j初始值继承自上面的循环
//正常单次
}

由于基本块内乱序执行,代码的顺序也不重要。
加上寄存器重命名来形成流水的存在,寄存器名也不重要。当然数据依赖还是要正确。

对于两层循环的双层手动展开

思路: 外层多load数据到寄存器,但是运行的任何时候也不要超过寄存器数量的上限(特别注意在内层循环运行一遍到末尾时)。

左图外层load了8个寄存器,但是右边只有2个。

特别注意在内层循环运行一遍到末尾时:

如图,黄框就有16个了。

注意load的速度也有区别

所以内层调用次数多,尽量用快的

1
2
_mm256_loadu_ps >> _mm256_broadcast_ss > _mm256_set_epi16
0.04 >> 0.5
1
2
3
4
5
6
7
vsub  vmax    ps 0.02      Latency 4
vand Latency 1

vadd ps 0.80 Throughput 0.5
vhadd Latency 7
vcvtps2pd 2.00 Latency 7
vextractf128 0.50 Latency 3

|指令|精度|时间(吞吐延迟和实际依赖导致)|Latency|Throughput
|-|-|-|-|-|-|
|_mm256_loadu_ps /_mm256_broadcast_ss|||7|0.5
|vsub vmax | ps| 0.02 | 4|0.5
vand ||0.02| 1|0.33
vadd |ps |0.80 |4| 0.5
vhadd ||0.8| 7|2
vcvtps2pd || 2.00 | 7|1
vextractf128 || 0.50 | 3|1

向量化double变单精度没有提升

17条avx计算 5load 2cvt 2extract

单位时间 avx计算 load cvt extract
2.33 3.68 12.875 4.1

可见类型转换相当耗费时间,最好在循环外,精度不够,每几次循环做一次转换。

GCC编译器优化

-march=skylake-avx512是一条指令

-mavx2 是两条指令

1
2
vmovupd xmm7, xmmword ptr [rdx+rsi*8]
vinsertf128 ymm1, ymm7, xmmword ptr [rdx+rsi*8+0x10], 0x1

原因是不对齐的访存在老架构上可能更快

O3对于核心已经向量化的代码还有加速吗?

将IPCC初赛的代码去掉O3发现还是慢了10倍。

为什么连汇编函数调用也慢这么多呢?

这个不开O3的编译器所属有点弱智了,一条指令的两个操作数竟然在rbp的栈里存来存去的。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

IPCC Preliminary SLIC Analysis part3 : Hot spot analysis

vtune hotspots




vtune threading



GUN profile gprof + gprof2dot graphviz

1
2
3
4
5
6
g++ -pg -g -std=c++11 SLIC.cpp -o SLIC
./SLIC # generate gmon.out
less gmon.out
"gmon.out" may be a binary file. See it anyway?
gprof ./SLIC
gprof ./SLIC| /home/shaojiemike/github/isc21-gpaw/LogOrResult/profile/gprof2dot.py -n0 -e0 | dot -Tpng -o output.png



没什么用

接下来

  1. 向量化
  2. 并行化

什么时候OpenMP并行,什么时候MPI并行

根据具体资源情况来,貌似是一个节点,那可以从OpenMP入手

自动并行化

Intel编译器的自动并行化功能可以自动的将串行程序的一部分转换为线程化代码。进行自动向量化主要包括的步骤有,找到有良好的工作共享(worksharing)的候选循环;对循环进行数据流(dataflow)分析,确认并行执行可以得到正确结果;使用OpenMP指令生成线程化代码。

/Qparallel:允许编译器进行自动并行化

/Qpar-reportn:n为0、1、2、3,输出自动并行化的报告

说明:/Qparallel必须在使用O2/3选项下有效

c++向量化怎么实现

什么是向量化

所谓的向量化,简单理解,就是使用高级的向量化SIMD指令(如SSE、SSE2等)优化程序,属于数据并行的范畴。

如何对代码向量化

向量化的目标是生成SIMD指令,那么很显然,要对代码进行向量化,

第一是依靠编译器来生成这些指令;

第二是使用汇编或Intrinsics函数。

自动向量分析器

Intel编译器中,利用其自动向量分析器(auto-vectorizer)对代码进行分析并生成SIMD指令。另外,也会提供一些pragmas等方式使得用户能更好的处理代码来帮助编译器进行向量化。

  1. 基本向量化
    /Qvec:开启自动向量化功能,需要在O2以上使用。在O2以上,这是默认的向量化选项,默认开启的。此选项生成的代码能用于Intel处理器和非Intel处理器。向量化还可能受其他选项影响。由于此选项是默认开启的,所以不需要在命令行增加此选项。

  2. 针对指令集(处理器)的向量化
    /QxHost:针对当前使用的主机处理器选择最优的指令集优化。

对于双重循环,外层循环被自动并行化了,而内层循环并没有被自动并行化,内层循环被会自动向量化。

影响向量化的因素

  1. 首先当然是指令集是否支持
  2. 内存对齐相关的问题,也是影响向量化的,很多的SSE指令都要求内存是16字节对齐,如果不对齐,向量化会得到错误结果。

如何判断向量化成功

看汇编代码
没成功需要手动内联向量化汇编代码???

Intel 编译器的向量化实现

AMD 编译器向量化实现

AMD 与 Intel 编译器的区别

需要进一步的研究学习

暂无

遇到的问题

暂无

参考文献

https://blog.csdn.net/gengshenghong/article/details/7027186

https://blog.csdn.net/gengshenghong/article/details/7034748

https://blog.csdn.net/gengshenghong/article/details/7022459