Posted 2023-07-14Updated 2025-01-30Tutorials20 minutes read (About 3059 words)

vtune的安装和profile

使用

由于snode0有sudo

1 2	source /opt/intel/oneapi/setvars.sh sudo vtune-gui

sudo后图形化界面 MobaXterm打不开的原因参考这个

Step1 : Performance Snapshot 参数说明

以IPCC2022 初赛支撑点计算的baseline为例

Logical Core Utilization

1 2	Effective Logical Core Utilization: 3.8% (2.436 out of 64) Effective Physical Core Utilization: 6.4% (2.053 out of 32)

CPU利用率主要是指计算有效占比。为100%意味着所有逻辑CPU都是由应用程序的计算占用。

Microarchitecture Usage

微架构使用指标是一个关键指标，可以帮助评估(以%为单位)你的代码在当前微架构上运行的效率。

微架构的使用可能会受到

long-latency memory长延迟访存、
floating-point, or SIMD operations浮点或SIMD操作的影响;
non-retired instructions due to branch mispredictions;由于分支错误预测导致的未退役指令;
instruction starvation in the front-end.前端指令不足。

vtune的建议

Microarchitecture Usage: 37.7% of Pipeline Slots
    Retiring: 37.7%
    Front-End Bound: 16.9%
    Back-End Bound: 23.8%
    Memory Bound: 11.9%
    Core Bound: 11.9%
    Bad Speculation: 21.5%

针对Back-End Bound: 23.8%的建议如下：

A significant portion of pipeline slots are remaining empty.
(??? 他是指有23.8% empty还是被使用了呢)

When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable to support.

This opportunity cost results in slower execution.

Long-latency operations like divides and memory operations can cause this,
as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

针对Bad Speculation: 21.5%的建议如下：

A significant proportion of pipeline slots containing 21.5% useful work are being cancelled.

This can be caused by mispredicting branches or by machine clears. Note that this metric value may be highlighted due to Branch Resteers issue.

Retiring metric

Retiring metric represents a Pipeline Slots fraction utilized by useful work, meaning the issued uOps that eventually get retired.
Retiring metric 表示有用工作所使用的Pipeline slot流水线管道的比例，所有发射的uOps最终都会retired。

Ideally, all Pipeline Slots would be attributed to the Retiring category.
理想情况下，所有的管道槽都应该归于退休类别。

Retiring of 100% would indicate the maximum possible number of uOps retired per cycle has been achieved. 100%的退役表明每个周期内退役的uop数量达到了可能的最大值。

Maximizing Retiring typically increases the Instruction-Per-Cycle metric.
最大化Retiring通常会增加IPC。

Note that a high Retiring value does not necessary mean no more room for performance improvement.
For example, Microcode assists are categorized under Retiring. They hurt performance and can often be avoided.

Microcode assists根据Intel的解释是

当遇到特殊的计算(比如处理非常小的浮点值(所谓的逆法线)时），浮点单元并没有被设置为本机执行这些操作。为此需要在指令流中插入可能有数百个指令长的小程序，对性能会造成很大的影响。

Front-End Bound

Front-End Bound metric represents a slots fraction where the processor’s Front-End undersupplies its Back-End. 该指标表示前端产生的指令是否足以支持后端处理。

Front-End denotes the first part of the processor core responsible for fetching operations that are executed later on by the Back-End part. 前端将指令分解成uops供后端处理。

Within the Front-End, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uOps). 在前端中，分支预测器预测下一个要获取的地址，缓存行从内存子系统中获取，解析为指令，最后解码为微操作(uOps)。

Front-End Bound metric denotes unutilized issue-slots when there is no Back-End stall (bubbles where Front-End delivered no uOps while Back-End could have accepted them). For example, stalls due to instruction-cache misses would be categorized as Front-End Bound

Front-End Bound指标表示当后端没有停顿时未使用的发射槽(bubbles: 前端没有交付uOps，而发射给后端的)。例如，由于指令缓存未命中而导致的暂停将被归类为Front-End Bound

Back-End Bound

metric represents a Pipeline Slots fraction where no uOps are being delivered due to a lack of required resources for accepting new uOps in the Back-End. 该指标表示后端uops是否出现了因为硬件资源紧张而无法处理的问题。

Back-End is the portion of the processor core where an out-of-order scheduler dispatches ready uOps into their respective execution units, and, once completed, these uOps get retired according to the program order. 后端的乱序执行，顺序Reire模型。

For example, stalls due to data-cache misses or stalls due to the divider unit(除法器？) being overloaded are both categorized as Back-End Bound. Back-End Bound is further divided into two main categories: Memory Bound and Core Bound.

Memory Bound

This metric shows how memory subsystem issues affect the performance. Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressure on the pipeline.

Core Bound

This metric represents how much Core non-memory issues were of a bottleneck. 表明核心的非内存原因成为了瓶颈

Shortage in hardware compute resources, 硬件资源的短缺
or dependencies software’s instructions are both categorized under Core Bound. 指令间的依赖

Hence it may indicate

the machine ran out of an OOO resources,
certain execution units are overloaded
or dependencies in program’s data- or instruction- flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).

Bad Speculation(分支预测错误)

represents a Pipeline Slots fraction wasted due to incorrect speculations.

This includes slots used to issue uOps that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from an earlier incorrect speculation.

For example, wasted work due to mispredicted branches is categorized as a Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.

这里的Nukes, 猜测是数据预取预测错误，带来的访存影响像核爆一样大吧.

Memory Bound

Memory Bound: 11.9% of Pipeline Slots
    L1 Bound: 7.9%
    L2 Bound: 0.2%
    L3 Bound: 2.5%
    DRAM Bound: 2.0%
    Store Bound: 0.3%
    NUMA: % of Remote Accesses: 13.2%

This metric shows how memory subsystem issues affect the performance. Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. 该项表明了有多少流水线的slots因为load或者store指令的需求而被迫等待

This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation
这是指不连续访存吗？

in addition to less common cases where stores could imply back-pressure on the pipeline.

L1 Bound

This metric shows how often machine was stalled without missing the L1 data cache.
在不发生L1 miss的情况下，指令stall的频率。(因为其他原因导致stall？)

The L1 cache typically has the shortest latency. However, in certain cases like loads blocked on older stores, a load might suffer a high latency even though it is being satisfied by the L1. 假设load了一个刚store的值，load指令也会遇到很大的延迟。

L2 Bound

This metric shows how often machine was stalled on L2 cache. Avoiding cache misses (L1 misses/L2 hits) will improve the latency and increase performance.

L3 Bound

This metric shows how often CPU was stalled on L3 cache, or contended with a sibling Core(与兄弟姐妹核竞争). Avoiding cache misses (L2 misses/L3 hits) improves the latency and increases performance.

DRAM Bound

This metric shows how often CPU was stalled on the main memory (DRAM). Caching typically improves the latency and increases performance.

DRAM Bandwidth Bound

This metric represents percentage of elapsed time the system spent with high DRAM bandwidth utilization. Since this metric relies on the accurate peak system DRAM bandwidth measurement, explore the Bandwidth Utilization Histogram and make sure the Low/Medium/High utilization thresholds are correct for your system. You can manually adjust them, if required.

Store Bound

This metric shows how often CPU was stalled on store operations. Even though memory store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls.

NUMA: % of Remote Accesses

In NUMA (non-uniform memory architecture) machines, memory requests missing LLC may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. It is recommended to keep as much frequently accessed data local as possible. This metric shows percent of remote accesses, the lower the better.

可以用之前的

Vectorization

This metric represents the percentage of packed (vectorized) floating point operations. 0% means that the code is fully scalar. The metric does not take into account the actual vector length that was used by the code for vector instructions. So if the code is fully vectorized and uses a legacy instruction set that loaded only half a vector length, the Vectorization metric shows 100%.

Vectorization: 23.7% of Packed FP Operations
    Instruction Mix: 
    SP FLOPs: 0.9%
    Packed: 99.9%
    128-bit: 0.1%
    256-bit: 99.8%
    512-bit: 0.0%
    Scalar: 0.1%
    DP FLOPs: 2.9%
    Packed: 0.0%
    Scalar: 100.0%
    x87 FLOPs: 0.0%
    Non-FP: 96.2%
    FP Arith/Mem Rd Instr. Ratio: 0.091
    FP Arith/Mem Wr Instr. Ratio: 0.308

针对Vectorization: 23.7%的建议

A significant fraction of floating point arithmetic instructions are scalar. Use Intel Advisor to see possible reasons why the code was not vectorized.

SP FLOPs

The metric represents the percentage of single precision floating point operations from all operations executed by the applications. Use the metric for rough estimation of a SP FLOP fraction. If FMA vector instructions are used the metric may overcount.

X87 FLOPs

The metric represents the percentage of x87 floating point operations from all operations executed by the applications. Use the metric for rough estimation of an x87 fraction. If FMA vector instructions are used the metric may overcount.

X87是X86体系结构指令集的浮点相关子集。它起源于8086指令的扩展，以可选的浮点协处理器的形式与相应的x86 cpus配合使用。这些微芯片的名称在“ 87”中结尾。

FP Arith/Mem Rd Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory write instructions. A value less than 0.5 indicates unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution.

小于0.5的值表示向量操作的未对齐数据访问，这可能会对矢量指令执行的性能产生负面影响。

Step2 : Hotspots

User-Mode Sampling只能采集单核的数据，来分析算法的优化。

Hardware Event-Based Sampling硬件时间采集能采集全部核心，但是要少于几秒钟？

这个硬件采集慢，而且到一半报错了，发生什么事了？

网上说是root权限的原因,但是我是用root运行的

反而用普通用户能正常跑Hardware Event-Based Sampling和微架构分析

example

手动向量化该区域。

核心时间是 $k*n^2$ 次绝对值和，取最大值

优化思路：

手动向量化（假设一次处理p个）

第一个n层取出 k个 rebuilt[i*k+ki] 重复读取到向量寄存器里，

第二个n层取出k 个连续的p个，到向量寄存器里。最后不足补0特殊处理，但是一般n都是4的倍数，可能可以不处理。8就要处理了。

做向量fabs的结果缓存在k个向量寄存器里。

再对这个k个向量寄存器做横向的向量最大值操作到一个向量寄存器。不足的补0(取最大值不影响)

最后这一个向量寄存器做寄存器内求和，再加到 chebyshevSum 里.

这样就实现了p个元素的向量操作。这样一趟共需要3*k个向量寄存器。
手动数据预取
1. __builtin_prefetch()
手动循环展开形成计算访存流水
1. 怎么根据输入来规模来展开？
分块

访存分析

github对应项目与赛题

HPL-PL

复现机器

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          36
On-line CPU(s) list:             0-35
Thread(s) per core:              1
Core(s) per socket:              18
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping:                        1
CPU MHz:                         1296.157
CPU max MHz:                     3300.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4199.98
Virtualization:                  VT-x
L1d cache:                       1.1 MiB
L1i cache:                       1.1 MiB
L2 cache:                        9 MiB
L3 cache:                        90 MiB

baseline

$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ gcc -std=c11 conway.c -o Conway
$ ./Conway
……
Iter 997...
Iter 998...
Iter 999...
136527.433000 ms

优化步骤

由于O3和并行会导致热点代码不可读

在可迭代优化的例子下，根据vtune最大化单核性能。

很明显不是计算密集的应用，怎么形成流水最大化带宽利用，划分重复利用元素提高Cache命中率是重点(向量化对计算加速明显)

替换if tmp[i][j] = (!(cnt^3))||((a[i][j]&1)&&(!(cnt^4)));
去除中间不必要的拷贝
int 变 char
OMP_PROC_BIND=true 绑定线程到对应local处理器和对应local内存

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

实验室同学黄业琦参加了HPC-PL全明星。想复现一下效果
之前Nvidia Nsight用得很爽，想到vtune的访存优化部分和汇编对应的分析，使用的很少。想从提高计算流水和访存连续流水的角度结合vtune优化。

参考文献

无

Posted 2021-08-27Updated 2025-01-30Tutorialsa few seconds read (About 85 words)

IPCC Preliminary SLIC Case1/2/3

case 1

default

enforce_intel

case 2

default

enforce_intel

case 3

default

enforce_intel

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2021-08-24Updated 2025-01-30Tutorialsa minute read (About 188 words)

IPCC Preliminary SLIC Optimization 6: Non-blocking MPI

非阻塞MPI

MPI_Send & MPI_receive

MPI_AllTogether()更慢，需要4s

手动向量化对齐

debug

1 2	vx = _mm256_set_pd(x); #改成 vx = _mm256_set_pd(x+3,x+2,x+1,x);

发现不对劲，打印更多输出。第一次循环肯定是对的因为和DBL_MAX比较。

需要进一步的研究学习

为什么明明有56GB的IB网，传输速度还是这么慢呢？写比较慢？

7*8=56 8条通道

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2021-08-18Updated 2025-01-30Tutorials3 minutes read (About 477 words)

Hybrid Multithreaded/OpenMP + MPI parallel Programs

混合编程需要注意的问题

https://www.nhr.kit.edu/userdocs/horeka/batch_slurm_mpi_multithread/ 看这个

还有个ppt 16

google hydrid openmpi openmp

intelmpi 编译

这里值得要注意的是，似乎直接用mpif90/mpicxx编译的库会报错，所以需要用

icc -openmp hello.cpp -o hello -DMPICH_IGNORE_CXX_SEEK -L/Path/to/mpi/lib/ -lmpi_mt -lmpiic -I/path/to/mpi/include
其中-DMPICH_IGNORE_CXX_SEEK为防止MPI2协议中一个重复定义问题所使用的选项，为了保证线程安全，必须使用mpi_mt库

对于intel的mpirun，必须在mpirun后加上-env I_MPI_PIN_DOMAIN omp使得每个mpi进程会启动openmp线程。

通过export OMP_NUM_THREADS来控制每个MPI产生多少线程。

OpenMPI 如何实现mult-thread(OpenMP)²

检查编译安装支持mult-thread

1
2
3

shell$ ompi_info | grep "Thread support"
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes)
shell$

“MPI_THREAD_MULTIPLE: yes”说明是支持的。

在C程序里支持mult-thread

#include <mpi.h>
int MPI_Init_thread(int *argc, char ***argv,
    int required, int *provided)

argc
        C/C++ only: Pointer to the number of arguments.
argv
        C/C++ only: Argument vector.
required
        Desired level of thread support (integer).
provided
        Available level of thread support (integer).

required 可选值分别是0，1，2，3

MPI_THREAD_SINGLE
        Only one thread will execute.
MPI_THREAD_FUNNELED
        If the process is multithreaded, only the thread that called MPI_Init_thread will make MPI calls.
MPI_THREAD_SERIALIZED
        If the process is multithreaded, only one thread will make MPI library calls at one time.
MPI_THREAD_MULTIPLE
        If the process is multithreaded, multiple threads may call MPI at once with no restrictions.

MPI_Init_thread调用MPI_thread_SINGLE等同于调用MPI_Init。

注意

3.1.6的多线程支持还在初级阶段。开销很高（虽然我不知道为什么）

需要进一步的研究学习

学习MapReduce或者Hadoop？ pthread vs openmp?

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://blog.csdn.net/Morizen/article/details/113863591

[2] OpenMPI-multThread

Posted 2021-08-17Updated 2025-01-30Tutorials9 minutes read (About 1297 words)

IPCC Preliminary SLIC Optimization 5: MPI + OpenMP

AMD

技术路线	描述	总时间	加速比	备注
Baseline	串行程序	161.7s s	1
more3omp	前面都是可以证明的有效优化 omp_num=32	14.08s
more3omp	前面都是可以证明的有效优化 omp_num=64	11.4s
deletevector	把sz大小的3个vector,移到全局变量，但是需要提前知道sz大小/声明一个特别大的	10.64s		可以看出写成全局变量也不会影响访问时间
enforce_Lscan	IPCC opt 4	8.49s	19
enforce_Lscan_MPI_intel	intel icpc	3.8s	42.36
Baseline2-max ppm	1.2GB ppm 10102440*1024	928s
enforce_Lscan	Baseline2	43.79s	21.2
enforce_Lscan_MPI_intel	intel icpc + 双节点两个时间 + MPI(DoRGBtoLABConversion)	18.8s / 20s	46.4
enforce_Lscan_intel	intel icpc + 单节点	15.8s	58.74	MPI(DoRGBtoLABConversion)负优化了2s
manualSIMD		13.9s
stream		13.6s
vec2mallocOMP		11.0s
mmap		10.6s
+ -O3	enforce_Lscan_intel	16.2s
+ -xHost	结果不对	17.8s
-Ofast		16.9s
-ipo		15.9s
-O3 -ipo		16.8s
-O3 -march=core-avx2 -fma -ftz -fomit-frame-pointer		16.0s
g++ suggested options	-O3 -march-znver1 -mtune=znver1 -fma -mavx2 -m3dnow -fomit-frame-pointer	18.1s
g++ suggested options2	-O3 -march-znver2 -mtune=znver2 -fma -mavx2 -m3dnow -fomit-frame-pointer	19.79s
g++ -Ofast		16.9s
aocc -Ofast		16.3s
aocc suggested options		16.2s

MPI编程

由于是打算两节点两进程MPI，虽然没有OpenMP的共享内存，但是也希望通信能少一点。

PerformSuperpixelSegmentation_VariableSandM

下面关于同步区域的想法是错误的：
因为中心点移动会十分不确定，所以全部同步是最好的。

第一部分core的思路
2. 上面numk个中心点直接一分为2，需要同步的是中间相连的$$width*(3S)$$个中心点(由于PerturbSeeds扰动，而且offset比较大，应该是中间相邻的2排，大约3S的高度的区域,上下1.5S高度)。
3. distlab需要后面覆盖前面的(当然是计算了的区域)。klabels是取distvec更小对应的那个，应该要写个自定义归约。
4. numk个中心点有奇数行和偶数行，经过思考后是一样的。
第二部分各中心maxlab的思路（从sz里提取numk个中心的数据）
2. sz直接一分为2,最小同步的话，就是中间相邻中心点maxlab要max归约。
第三部分计算sz里的numk个中心点的质心和
2. 同理，sz直接一分为2，vector相加归约同步

DoRGBtoLABConversion 0.61s

用MPI_Send写，但是一开始没注意是阻塞的，但是为什么这么慢呢？

对比之前的enforce_Lscan 8.49s

DoRGBtoLABConversion 0.56s
PerformSuperpixelSegmentation_VariableSandM 5.52s
1. core 0.53s
2. maxlab 0.02s
3. sigma 0.03s
DetectLabEdges 0.31s
EnforceLabelConnectivity 1.19s
PerformSuperpixelSegmentation_VariableSandM 0.88s

慢了10~20倍猜测：

printf的原因？ no 不打印也一样
omp_num的值不对？ maybe no
不在两个节点上？ no
g++ mpicxx? no
没有用IB ？貌似也不是
openmpi不支持openmp ? 探究方向

好像是openmp没正常运行omp_num的值为 1，32，64时间都一样。感觉是混合编程的编译问题，而且好像是假Openmp并行，哪里有锁的样子。突然想起来，Quest的混合变成cmake需要打开multthread类似的支持，但是这里并没用。

好像也不是mpi_init_thread的问题

尝试intelmpi

果然有奇效。(结果是对的，后面我没截图了)。看到这里，可能你会觉得这个问题是OpenMPI有地方不支持openmp。但是后面有神奇的事情，如果NODELIST是fa,而不是fb就不能跑，会直接卡住。😰

首先没找到官方手册说明不同，然后研究一下这两个分区的不同。好吧从IB,cpu,内存都没区别。

限制nodelist再跑一遍。

加上打印时间，用fb分区

这个问题又没有了,但是fa分区由于经常跑可能会热一些。

最大的ppm例子

由于时间已经进5s了。所以我们需要更大的例子，再讨论2节点的开销收益，之前的例子是256034000。
这里生成了1024040960的ppm.再大ppm程序的数组都申请不到栈空间了,需要重新数据结构。

重跑当前最快的enforce_Lscan

icpc + enforce_Lscan_MPI(DoRGBtoLABConversion)

icpc + enforce_Lscan

g++ suggested options

icpc + manualSIMD + lessLscan

icpc + manualSIMD + LscanSimple

icpc + manualSIMD + LscanSimple + stream

icpc + manualSIMD + LscanSimple + stream + mallocOMPinit

icpc + manualSIMD + LscanSimple + stream + mallocOMPinit + mmap

icpc + manualSIMD + LscanSimple + stream + mallocOMPinit + mmap + unrollLoop

放弃的原因

https://www.bilibili.com/video/BV1a44y1q782 58mins-58min50s

需要进一步的研究学习

暂无

遇到的问题

混合编程写的有问题，双节点不快反慢。怎么写呢？
那段串行代码真的不能并行吗？
向量化为什么没有提升呢，是要循环展开吗？

姜师兄建议

MPI非阻塞通信 gather reduce
手动向量化

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2021-08-13Updated 2025-01-30Tutorials21 minutes read (About 3215 words)

IPCC Preliminary SLIC Optimization 4: EnforceLabelConnectivity

node5/6

技术路线	描述	总时间	加速比
Baseline	串行程序	207 s	1
more3omp	0.4+5+0.3	23.0s
时间细划，初始化omp	0.03+5+0.1	21.2s
不换算法，必须加锁	特别满
扫描行算法	0.03+2.2+0.1	18.5s
扫描行算法 + task动态线程池		26s
扫描行算法 + task动态线程池 + 延迟发射		26s
扫描行算法 + task动态线程池 + 延迟发射		26s
扫描行算法 + 化解重复，提高粒度：每个线程一行，不同线程杜绝同一行扫描行算法	但是没并行起来	106s
扫描行算法 + 常驻64线程		86s

初始时间

1	Time taken is 21.364595 6.066717 EnforceLabelConnectivityComputing

时间细划，初始化omp

细致划分,malloc size大小的空间不耗时，是初始化为-1耗时

Time taken is 16.963094 0.000025 EnforceLabelConnectivity	numlable
Time taken is 17.395982 0.432887 EnforceLabelConnectivity	xvec yvec
Time taken is 22.668060 5.272079 EnforceLabelConnectivity iteration
Time taken is 23.001499 0.333438 EnforceLabelConnectivity klabelsComputing

修改后

Time taken is 16.063057 0.000026 EnforceLabelConnectivity       numlable
Time taken is 16.095485 0.032428 EnforceLabelConnectivity       xvec yvec
Time taken is 21.116599 5.021114 EnforceLabelConnectivity iteration
Time taken is 21.237874 0.121275 EnforceLabelConnectivity klabelsComputing

改 dx4，dy4 实现访存连续性

但是可能会导致adjlabel的值不对，导致结果不对

flood fill

openmp线程池+不加锁

4分钟+, 满核结束不了，已经混乱了。

openmp线程池+加锁（单/多个）

5分钟+，满核结束不了，大翻车。

可能的原因：

本来不是计算密集型，加锁导致是串行，而且还有sz次锁的获取与释放的开销。
某个线程改了nlabels，其余运行时读取可能还要同步修改。

我又想到是不是只有一个锁，有没有多个锁的实现。还是超时结束不了。

omp_set_lock(&lock[nindex]); //获得互斥器
if( 0 > nlabels[nindex] && labels[oindex] == labels[nindex] )
{
	xvec[count] = x;
	yvec[count] = y;
	nlabels[nindex] = label;
	count++;
}
omp_unset_lock(&lock[nindex]); //释放互斥器

多个锁满足了nlabels的竞争，但是count的竞争还是只能一个锁。除非将数组保存变成队列才有可能，因为没计数器了。

openmp线程池+队列+(不)加锁

好耶，segmentation fault (core dumped)。果然读到外面去了。

不好耶了，并行的地方加了锁，还是会

1	double free or corruption (out) //内存越界之类的

debug 不加锁

200~400行不等seg fault。

debug 加锁

然后我打了时间戳

可以看出至少前面是正常的。

多运行几次，有时候segfault，有时corruption，我服了。

但是位置好像还是在上面的循环

每次报错位置还不一样，但是迭代的点还是对的。

队列的原子性操作需要自己加锁定义

https://stackoverflow.com/questions/32227321/atomic-operation-on-queuet

openmpfor+双队列+(不)加锁

1	munmap_chunk(): invalid pointer

黑人问号？纳尼

没办法，只能加锁，读取，写入都加锁，但是就是特别慢,4分钟+。

omp_set_lock(&lock); //获得互斥器
qindex = workq.front();
workq.pop();
omp_unset_lock(&lock); //释放互斥器

omp_set_lock(&lock2); //获得互斥器
if( 0 > nlabels[nindex] && labels[oindex] == labels[nindex] )
{
	nlabels[nindex] = label;
	workq2.push(nindex);
	saveq.push(nindex);
}
omp_unset_lock(&lock2); //释放互斥器

读取，写入不是同一个队列，尝试用2个锁，还是特别慢，5分钟根本跑不完。

队列换成栈是一样的

q.front()变成了q.top()

扫描行实现

扫描线算法至少比每像素算法快一个数量级。

Time taken is 16.144375 13.062605 PerformSuperpixelSegmentation_VariableSandM 循环
Time taken is 16.144399 0.000025 EnforceLabelConnectivity       numlable
Time taken is 16.177300 0.032901 EnforceLabelConnectivity       xvec yvec
Time taken is 48.978709 32.801409 EnforceLabelConnectivity iteration
Time taken is 49.086252 0.107543 EnforceLabelConnectivity klabelsComputing time=49086 ms
There are 86475718 points' labels are different from original file.

不知道哪里错了，需要debug。简单debug,发现小问题。

Time taken is 15.670141 0.000024 EnforceLabelConnectivity      numlable
Time taken is 15.718014 0.047873 EnforceLabelConnectivity      xvec yvec
Time taken is 22.103680 6.385666 EnforceLabelConnectivity iteration
Time taken is 22.219160 0.115480 EnforceLabelConnectivity klabelsComputing time=22219 ms
There are 0 points' labels are different from original file.

但是尴尬的是并没有快。哭哭哭~~~~。

优化一下变量，快了3秒，大胜利！！！

Time taken is 16.203514 0.000029 EnforceLabelConnectivity      numlable
Time taken is 16.234977 0.031463 EnforceLabelConnectivity      xvec yvec
Time taken is 18.428990 2.194013 EnforceLabelConnectivity iteration
Time taken is 18.527664 0.098674 EnforceLabelConnectivity klabelsComputing time=18527 ms
There are 0 points' labels are different from original file.

扫描行并行实现 + 上下建线程，左右在线程里跑

用task写

虽然我在总结里写了，很难控制。但是，哎，我就是不信邪，就是玩😉

喜提segfault,打印task调用，发现task从上到下，之字形调用，而且没用一个结束的。按照设想，横向x增加比调用task快的，现在好像task堵塞的样子。

好像是没加,但是结果不对

1
2
3

#pragma omp parallel num_threads(64)
{
	#pragma omp single

让我们仔细分析一下是怎么偏离预期的：

(0,2)调用（0，3），（0，3）调用（0，4）很正常。但是（0，3）竟然调用了（2，4），这说明（0，3）循环到（1，3）时，发现（1，4）是已经处理的，而（2，4）是未处理的。进一步说明了（0，4）在被（0，3）创建了之后，先一步循环到（1，4），并将其处理。
（0，4）先循环到(1,4),反手还调用（1，3）。然后由于（0，3）调用了（2，4）。导致（0，4）循环到后面以为到（2，4）就截止了。
虽然我说不出有什么问题，但是这不受控制的混乱调用，肯定不是我们想见的。

尝试把占用时间的print去掉。时间不短（重复调用），还是错的。(后面才发现，错误是threadcount，threadq里，每次循环完忘记清空了。日~~~)

Time taken is 16.226124 0.000024 EnforceLabelConnectivity      numlable
Time taken is 16.258697 0.032573 EnforceLabelConnectivity      xvec yvec
Time taken is 26.320222 10.061525 EnforceLabelConnectivity iteration
Time taken is 26.401399 0.081177 EnforceLabelConnectivity klabelsComputing time=26401 ms
There are 86588716 points' labels are different from original file.

Time taken is 15.743455 0.000025 EnforceLabelConnectivity       numlable
Time taken is 15.773654 0.030198 EnforceLabelConnectivity       xvec yvec
Time taken is 26.348979 10.575326 EnforceLabelConnectivity iteration
Time taken is 26.442129 0.093150 EnforceLabelConnectivity klabelsComputing time=26442 ms
There are 0 points' labels are different from original file.

现在的想法是要有先后顺序，把对(x,y)一行都处理完，再发射task。或者采取延迟发射的。

延迟发射

把发射任务(x+delay,y)用队列存储，每次循环check一下，最后循环结束后，在全部发射。
或者标记(x+delay,y)发射(x,y)。但是对于循环结束后的，不好处理。

Time taken is 17.344073 0.000027 EnforceLabelConnectivity      numlable
Time taken is 17.377535 0.033462 EnforceLabelConnectivity      xvec yvec
Time taken is 28.461901 11.084366 EnforceLabelConnectivity iteration
Time taken is 28.544698 0.082797 EnforceLabelConnectivity klabelsComputing time=28544 ms
There are 86588716 points' labels are different from original file.

很奇怪，结果不对。难道是delay的值太小。

把delay的值从10调整到750,甚至是2600，大于宽度了，结果还是不对。这是不对劲的，因为这时相当于把对(x,y)一行都处理完，再发射task。

这时我才感觉到是其他地方写错了，错误是threadcount，threadq里，每次循环完忘记清空了。日~~~

delay = 2600 结果是对了，但是也太慢了，至少要比串行快啊？

Time taken is 15.538704 0.000026 EnforceLabelConnectivity      numlable
Time taken is 15.577671 0.038968 EnforceLabelConnectivity      xvec yvec
Time taken is 28.233859 12.656188 EnforceLabelConnectivity iteration
Time taken is 28.332256 0.098396 EnforceLabelConnectivity klabelsComputing time=28332 ms

delay = 20 快了一点，哭

Time taken is 15.631368 0.000025 EnforceLabelConnectivity       numlable
Time taken is 15.661496 0.030128 EnforceLabelConnectivity       xvec yvec
Time taken is 26.788105 11.126609 EnforceLabelConnectivity iteration
Time taken is 26.869487 0.081382 EnforceLabelConnectivity klabelsComputing time=26869 ms
There are 0 points' labels are different from original file.

逆向优化分析

打上时间戳

end Time 84 32839 taken is 0.000000 dxy4
end Time 84 32839 taken is 0.000000 threadcount
end Time 84 32839 taken is 0.031285 core
end Time 84 32839 taken is 0.000023 count

说明还是并行没写好。

检查是否调用64核,htop显示是64核
猜测原因
- 产生了大量重复的任务，还是划分的原因，上下限制了之后，左右的重复情况如何化解。
  - 每个进程一行，task分配到y%64号线程去。但是openmp的task好像不能指定线程号。
  - 任务压入第y%64个队列，线程从队列取任务。
  - eg,第3行的后面两个线程，threadcount=0，无作用。
- 有许多任务量过小的情况，粒度不够，次数还多，导致调用产生的开销大
- task的线程池就是不靠谱
可以行分割或列分割，根据输入

化解重复，提高粒度：每个线程一行，不同线程杜绝同一行

任务压入第y%64个队列，线程从队列取任务。
- 但是这里同一队列的写入与读取又冲突了。可以用64个双队列，一写一读。在交换的时候等待+同步。
- 不同线程写入同一个也冲突，每个线程再来64个队列保存，同步的时候再汇总写入。

想法很美好，但是最后的效果并不是每次64线程，基本都只有1-5个任务。导致近似单线程还有调用开销。（node6有人，node5慢些）

Time taken is 36.212269 32.876626 PerformSuperpixelSegmentation_VariableSandM 循环
Time taken is 36.212297 0.000028 EnforceLabelConnectivity       numlable
Time taken is 36.247536 0.035239 EnforceLabelConnectivity       xvec yvec
Time taken is 106.097341 69.849805 EnforceLabelConnectivity iteration
Time taken is 106.204154 0.106813 EnforceLabelConnectivity klabelsComputing time=106204 ms
There are 0 points' labels are different from original file.

这个原因感觉是一开始只有1个，然后一般也就产生1/2个任务。将其初始任务改成64个就行。

但是如何一开始启动64个呢，我又提前不知道任务。

常驻64线程

写完又是segFault，debug

[64][64][10000]太大了，每次的队列应该没这么多[64][64][100]

对于结束的统计，要用同步一下，需要加critical。结果就对了
但是，这也太慢了

Time taken is 28.219408 0.000017 EnforceLabelConnectivity       numlable
Time taken is 28.271994 0.052586 EnforceLabelConnectivity       xvec yvec
Time taken is 83.591540 55.319546 EnforceLabelConnectivity iteration
Time taken is 83.696990 0.105450 EnforceLabelConnectivity klabelsComputing time=83696 ms
There are 0 points' labels are different from original file.

受控的分段任务

没时间研究

openmpfor+特殊双数组(1+4?)

没时间研究

需要进一步的研究学习

感觉要自己写个结构体

数据可以无序
最好数据各异
支持并行读每个元素（数组？
支持并行写一堆元素，并且维护size大小

遇到的问题

暂无

并行总结

在这次并行中，让我意识到几点

任务的划分一定不能重复，相互干扰。比如，四邻域泛洪任务重复会导致竞争问题，需要加锁。但是，描绘线，任务不重复，直接避免了加锁的低效。而且重复会导致计算重复，同时占用线程。
并行任务的结果，如果不是一定要存在同一个变量就分开存，既不需要线程私有变量，最后归约；也不会存同一个位置导致竞争。比如，这次的任务会产生一堆不相关的index，那直接每个线程一个数组存，既不会冲突，之后还能接着并行。或者用更大的sz大小数组存index，结果更不会冲突了。
对于任务数增加且不确定的情况，不推荐使用task进程。因为自动调度很难控制，既不知道迭代了多少，也不确定之后会不会有隐藏的竞争。推荐类似双队列的调度，确定一批任务，并行一批任务，同步一批任务的结果，然后重新并行。
1. 问题：中间并行一批任务的时候还是记得分开存结果。同步的时候再处理一下就行。
2. 双队列可能有任务量过少的问题，导致变单线程。
3. 想到了一种启动64常驻线程，产生任务又等待任务的结构。但是问题是：任务的保存要满足产生任务的写入和处理任务的读取。在不考虑写爆的情况(循环)，维护数组的写入与读取位置是可行的。任务的结束通过每个线程在读取不到任务时，判断自己发布的所有任务也被完成了，标记自己发布的任务完成了。所有发布的任务都完成，再结束。

好吧，我感觉我分析了一堆，就是在放屁。还是串行快，这个问题就难划分。就不是并行的算法。

编程总结

这次编程遇到的问题，大多数如下：

对每次循环开始时所以变量的清空，重新赋初值
结束时，全部清空。

参考文献

无

Posted 2021-08-06Updated 2025-01-30Tutorials6 minutes read (About 903 words)

IPCC Preliminary SLIC Optimization 3

node6

因为例子太小，导致之前的分析时间波动太大。所以写了个了大一点的例子，而且给每个函数加上了时间的输出，好分析是否有加速。(Qrz,node5有人在用。

技术路线	描述	总时间	加速比	备注
Baseline	串行程序	207 s	1
simpleomp	两处omp	57s
more1omp	maxlab	48s
more2omp	sigma + delete maxxy	24.8s	8.35
more3omp	DetectLabEdges + EnforceLabelConnectivity(该算法无法并行)	21.2s
icpc		13.4s
+ -O3		13.2s
+ -xHost		13.09s
+ -Ofast -xHost	基于icpc	12.97s
+ -ipo		12.73s	16.26
-no-prec-div -static -fp-model fast=2		14.2s		时间还多了，具体其他选项需要到AMD机器上试

Baseline 207s

DoRGBtoLABConversion 10.4s
PerformSuperpixelSegmentation_VariableSandM 187.3s
1. core 15.3s
2. maxlab 1s
3. sigma 2.3s

simpleomp 57s

DoRGBtoLABConversion 0.89s
PerformSuperpixelSegmentation_VariableSandM 46s
1. core 0.94-1.8s
2. maxlab 1s
3. sigma 2.3-2.6s

more1omp 48s

DoRGBtoLABConversion 0.82s
PerformSuperpixelSegmentation_VariableSandM 37s
1. core 1-2.3s
2. maxlab 0.04-0.1s
3. sigma 2.3s

more2omp 24.8s

DoRGBtoLABConversion 0.85s
PerformSuperpixelSegmentation_VariableSandM 13.5s
1. core 0.8-1.7s
2. maxlab 0.02-0.1s
3. sigma 0.1s
DetectLabEdges 3.7s
EnforceLabelConnectivity 5.2s

more2omp 21.2s

DoRGBtoLABConversion 0.74s
PerformSuperpixelSegmentation_VariableSandM 12.3s
1. core 1.1s
2. maxlab 0.02-0.1s
3. sigma 0.1s
DetectLabEdges 0.7s
EnforceLabelConnectivity 5.8s (需要换算法
PerformSuperpixelSegmentation_VariableSandM (vector声明的时间,可以考虑拿到外面去） 1.6s

icpc 13.4s

DoRGBtoLABConversion 0.44s
PerformSuperpixelSegmentation_VariableSandM 8.49s
1. core 0.5-1.1s
2. maxlab 0.04s
3. sigma 0.05s
DetectLabEdges 0.54s
EnforceLabelConnectivity 2.79s (需要换算法
PerformSuperpixelSegmentation_VariableSandM (vector声明的时间,可以考虑拿到外面去） 1.16s

12.7s

DoRGBtoLABConversion 0.42s
PerformSuperpixelSegmentation_VariableSandM 7.98s
1. core 0.5-1.1s
2. maxlab 0.04s
3. sigma 0.05s
DetectLabEdges 0.49s
EnforceLabelConnectivity 2.69s (需要换算法
PerformSuperpixelSegmentation_VariableSandM (vector声明的时间,可以考虑拿到外面去） 1.13s

IPCC AMD

技术路线	描述	总时间	加速比
Baseline	串行程序	161.7s s	1
more3omp	前面都是可以证明的有效优化 omp_num=32	14.08s
more3omp	前面都是可以证明的有效优化 omp_num=64	11.4s
deletevector	把sz大小的3个vector,移到全局变量，但是需要提前知道sz大小/声明一个特别大的	10.64s	可以看出写成全局变量也不会影响访问时间
enforce_Lscan	ipcc opt 4	8.49s

Baseline 161.7s

DoRGBtoLABConversion 11.5s
PerformSuperpixelSegmentation_VariableSandM 143s
1. core 11.5s
2. maxlab 0.8s
3. sigma 1.7s
DetectLabEdges 2.74s
EnforceLabelConnectivity 3.34s
PerformSuperpixelSegmentation_VariableSandM 1.11s

more2omp 14.08s

DoRGBtoLABConversion 0.69s
PerformSuperpixelSegmentation_VariableSandM 8.08s
1. core 0.73s
2. maxlab 0.02s
3. sigma 0.05s
DetectLabEdges 0.37s
EnforceLabelConnectivity 3.8s
PerformSuperpixelSegmentation_VariableSandM 1.1s

more2omp 11.4s

DoRGBtoLABConversion 0.61s
PerformSuperpixelSegmentation_VariableSandM 5.86s
1. core 0.53s
2. maxlab 0.02s
3. sigma 0.03s
DetectLabEdges 0.33s
EnforceLabelConnectivity 3.5s
PerformSuperpixelSegmentation_VariableSandM 1.02s

deletevector 10.64s

DoRGBtoLABConversion 0.59s
PerformSuperpixelSegmentation_VariableSandM 5.75s
1. core 0.53s
2. maxlab 0.02s
3. sigma 0.03s
DetectLabEdges 0.41s
EnforceLabelConnectivity 3.84s
PerformSuperpixelSegmentation_VariableSandM 0s

enforce_Lscan 8.49s

DoRGBtoLABConversion 0.56s
PerformSuperpixelSegmentation_VariableSandM 5.52s
1. core 0.53s
2. maxlab 0.02s
3. sigma 0.03s
DetectLabEdges 0.31s
EnforceLabelConnectivity 1.19s
PerformSuperpixelSegmentation_VariableSandM 0.88s

需要进一步的研究学习

外面声明vector
EnforceLabelConnectivity 换并行算法
1. 数据结构要求：
  1. 保存已经染色区域的位置，之后可能要还原
    1. 可以无序，有序最好，会访存连续
    2. x,y或者index也行。还是xy好判断边界
  2. 是4分还是8分，既然有重复，记录来的方向/路径,只向某方向移动。4是符合理论的，8不和要求，2有情况不能全部遍历。
  3. 3分倒是可以，但是实现小麻烦
2. flood fill 与 PBFS 特定结合
3. openmp线程池+锁(sz 大小的两个数组存 x y，nlabels存新的分类结果)+计时声明与flood+把这些在sz声明放外面
4. openmp线程池+队列(最后可以并行处理吧，要一个个pop?)+需要锁吗(这取决于队列的实现有没有靠计数器)
5. openmpfor+双队列*4/2？+需要锁吗
6. 扫描行实现 + 上下建线程，左右在线程里跑
  1. 多线程的访问存储连续性
7. 队列/栈是怎么实现代码的，速度怎么样（写入读取push pop，还有size）
8. 栈有size吗
在AMD机器加入MPI进行混合编程，运行2节点

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2021-08-03Updated 2025-01-30Tutorialsa few seconds read (About 74 words)

Training course - IPCC 5 Optimize common tools

objdump

通过反汇编可执行文件，查看汇编内容，来判断代码是否被优化(自动向量化，内联)

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2021-07-26Updated 2025-01-30Tutorials3 minutes read (About 507 words)

IPCC Preliminary SLIC Optimization 2

chivier advise on IPCC amd_256

技术路线	描述	时间	加速比
Baseline	串行程序	21872 ms	1
核心循环openmp	未指定	8079ms
核心循环openmp	单节点64核	7690ms	2.84
换intel的ipcp	基于上一步	3071 ms	7.12
-xHOST	其余不行，基于上一步	4012ms
-O3	基于上一步	3593ms

node5

Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz

技术路线	描述	时间	加速比
Baseline	串行程序	29240 ms	1
核心循环openmp	未指定(htop看出64核)	12244 ms
去除无用计算+两个numk的for循环	080501	11953 ms 10054 ms
计算融合(去除inv)	080502	15702 ms 14923 ms 15438 ms 11987 ms
maxlab openmp	基于第三行080503	13872 ms 11716 ms
	循环展开??	14436 ms 14232 ms 15680 ms

-xCOMMON-AVX512 not supports

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT, AVX2, AVX512F, ADX and AVX512CD instructions.

-xCORE-AVX2

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT and AVX2 instructions

没有 FXSAVE,BMI,LZCNT 有BMI1，BMI2

使用-xAVX,或者-xHOST 来选择可用的最先进指令集

1	Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT and AVX instructions.

-fast bugs

ld: cannot find -lstdc++
ld: cannot find -lstdc++
/public1/soft/intel/2020u4/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libiomp5.a(ompt-general.o): In function `ompt_pre_init':
(.text+0x2281): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/var/spool/slurm/d/job437118/slurm_script: line 23: ./SLIC_slurm_intel_o3: No such file or directory

AMD EPYC 7~~2

icpc -Ofast -march=core-avx2 -ipo -mdynamic-no-pic -unroll-aggressive -no-prec-div -fp-mode fast=2 -funroll-all-loops -falign-loops -fma -ftz -fomit-frame-pointer -std=c++11 -qopenmp SLIC_openmp.cpp -o SLIC_slurm_intel_o3

后续优化

基于核心的openmp并行

去除无用计算

1 2	delete all maxxy if(maxxy[klabels[i]] < distxy[i]) maxxy[klabels[i]] = distxy[i];

计算融合(减少访存次数)

将inv去除(效果存疑)
maxlab openmp并行(由于不是计算密集的，是不是要循环展开)

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2021-07-23Updated 2025-01-30Tutorials6 minutes read (About 890 words)

IPCC Preliminary SLIC Optimization 1

第一部分优化

从数据重用(不重复计算，降低计算量)、计算融合(减少访存次数)、循环重组、改变数据结构入手

数据重用

主体变量数据依赖梳理

一开始所有的RGB颜色在ubuff里，klabel存分类结果

首先经过转换，将ubuff的RGB转换为lvec avec bvec三个double[sz]数组
存在私有变量m_lvec m_avec m_bvec,供class内访问

优化建议：lab三种颜色存在一起，访问缓存连续

DoRGBtoLABConversion(ubuff, m_lvec, m_avec, m_bvec);

计算冗余一：

计算出的全体edges,只有一部分在后面一个地方用了196个中心以及周围8个节点。

优化建议：要用edges时再计算(保证了去除不必要计算和计算融合)

优化建议：kseedsl/a/b/x/y 分别用5个vector存是不好的，每个中心的5元组要存在一起，因为访问和修改都是一起的。

优化建议：

核心计算，是不是要拆开？
除以maxlab[n]，改成乘1/maxlab[n]
maxxy没有用，可以除去定义与数组维护（line 429）
disxy[i]也就可以不用数组

优化建议：

if判断用掩码
想将与每个像素i有关的属性放在一起，但是distvec要全部初始化。那我维护char*的passcheck数组判断是否已经遍历？未遍历直接赋值，已经遍历，比较后判断是否赋值。
对于2和并行化这个部分的问题：1.按照中心划分，存储每个点的距不同中心的距离，最后归约取最小。2. 并行还是按照坐标划分，判断在哪几个区域内，然后计算距离最小的)

优化建议：

对于求和部分labxy与1/clustersize??存在一起
这部分按坐标并行时，归约的是196个元素的最小值或者求和

vector 连续性

vector中的元素在内存中是连续存储的.

vector的实现是由一个动态数组构成. 当空间不够的时候, 采用类似于C语言的realloc函数重新分配空间. 正是因为vector中的元素是连续存储的, 所以vector支持常数时间内完成元素的随机访问. vector中的iterator属于Random Access Iterator.

cache缓存原理疑问

每级cache难道只存读取数据周围的所有地址数据吗？还是一块一块读的。

假如调度是一块一块读取的而且cache足够大存下时，对于m_lvec m_avec m_bvec，假如各读取同一块，会导致和将其存储在一起是一样的效果。对于m_lvec[i]的下一个元素m_lvec[i+1],m_avec[i+1],m_bvec[i+1]也在cache中。

chivier 建议

#pragma omp parallel for collapse(2)
icpc -xCOMMON-AVX512 -O3 -std=c++11 -qopenmp SLIC.cpp -o SLIC
g++ -fopenmp
先openMP优化，然后MPI一分为二
数据结构没有必要改，不会访存连续

minicoda for tmux zsh htop gcc9

pip install gdbgui to localhost

gdb tui enable

需要进一步的研究学习

暂无

遇到的问题

暂无

参考文献

无

vtune的安装和profile

使用

Step1 : Performance Snapshot 参数说明

Logical Core Utilization

Microarchitecture Usage

vtune的建议

Retiring metric

Front-End Bound

Back-End Bound

Memory Bound

Core Bound

Bad Speculation(分支预测错误)

Memory Bound

L1 Bound

L2 Bound

L3 Bound

DRAM Bound

DRAM Bandwidth Bound

Store Bound

NUMA: % of Remote Accesses

Vectorization

SP FLOPs

X87 FLOPs

FP Arith/Mem Rd Instr. Ratio

Step2 : Hotspots

example

访存分析

github对应项目与赛题

HPL-PL

复现机器

baseline

优化步骤

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

case 1

default

enforce_intel

case 2

default

enforce_intel

case 3

default

enforce_intel

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

非阻塞MPI

手动向量化对齐

debug

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

混合编程需要注意的问题

intelmpi 编译

OpenMPI 如何实现mult-thread(OpenMP)2

检查编译安装支持mult-thread

在C程序里支持mult-thread

注意

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

AMD

MPI编程

PerformSuperpixelSegmentation_VariableSandM

DoRGBtoLABConversion 0.61s

对比之前的enforce_Lscan 8.49s

尝试intelmpi

最大的ppm例子

放弃的原因

需要进一步的研究学习

遇到的问题

姜师兄建议

开题缘由、总结、反思、吐槽~~

参考文献

node5/6

OpenMPI 如何实现mult-thread(OpenMP)²