### critical vs atomic The fastest way is neither critical nor atomic. Approximately, addition with critical section is 200 times more expensive than simple addition, atomic addition is 25 times more expensive then simple addition.(**maybe no so much expensive**, the atomic operation will have a few cycle overhead (synchronizing a cache line) on the cost of roughly a cycle. A critical section incurs **the cost of a lock**.)
The fastest option (not always applicable) is to give each thread its own counter and make reduce operation when you need total sum.
### critical vs ordered omp critical is for mutual exclusion(互斥), omp ordered refers to a specific loop and ensures that the region **executes sequentually in the order of loop iterations**. Therefore omp ordered is stronger than omp critical, but also only makes sense within a loop.
omp ordered has some other clauses, such as simd to enforce the use of a single SIMD lane only. You can also specify dependencies manually with the depend clause.
Note: Both omp critical and omp ordered regions have an implicit memory flush at the entry and the exit.
### ordered example
vector<int> v;
#pragma omp parallel for ordered schedule(dynamic, anyChunkSizeGreaterThan1) for (int i = 0; i < n; ++i){ … … … #pragma omp ordered v.push_back(i); }
1 2 3 4 5 6
``` tid List of Timeline iterations 0 0,1,2 ==o==o==o 1 3,4,5 ==.......o==o==o 2 6,7,8 ==..............o==o==o
= shows that the thread is executing code in parallel. o is when the thread is executing the ordered region. . is the thread being idle, waiting for its turn to execute the ordered region.
With schedule(static,1) the following would happen:
1 2 3 4 5
tid List of Timeline iterations 0 0,3,6 ==o==o==o 1 1,4,7 ==.o==o==o 2 2,5,8 ==..o==o==o
private variables are not initialised, i.e. they start with random values like any other local automatic variable
firstprivate initial the value as the before value.
lastprivate save the value to the after region. 这个last的意思不是实际最后运行的一个线程,而是调度发射队列的最后一个线程。从另一个角度上说,如果你保存的值来自随机一个线程,这也是没有意义的。 firstprivate and lastprivate are just special cases of private
1 2 3 4 5 6 7
#pragma omp parallel { #pragma omp for lastprivate(i) for (i=0; i<n-1; i++) a[i] = b[i] + b[i+1]; } a[i]=b[i];
private vs threadprivate
A private variable is local to a region and will most of the time be placed on the stack. The lifetime of the variable’s privacy is the duration defined of the data scoping clause. Every thread (including the master thread) makes a private copy of the original variable (the new variable is no longer storage-associated with the original variable).
A threadprivate variable on the other hand will be most likely placed in the heap or in the thread local storage (that can be seen as a global memory local to a thread). A threadprivate variable persist across regions (depending on some restrictions). The master thread uses the original variable, all other threads make a private copy of the original variable (the master variable is still storage-associated with the original variable).
# shaojiemike @ node5 in ~ [11:26:56] $ vncserver -list
TigerVNC server sessions:
X DISPLAY # RFB PORT # PROCESS ID :1 5901 148718 (stale)
# shaojiemike @ node5 in ~ [11:29:39] $ vncpasswd Password:
# shaojiemike @ node5 in ~ [11:34:08] $ vncserver -kill :1 Killing Xtigervnc process ID 148718... which was already dead Cleaning stale pidfile '/home/shaojiemike/.vnc/node5:1.pid'!
# shaojiemike @ node5 in ~ [11:36:15] $ vncserver
New 'node5:2 (shaojiemike)' desktop at :2 on machine node5
Starting applications specified in /etc/X11/Xvnc-session Log file is /home/shaojiemike/.vnc/node5:2.log
Use xtigervncviewer -SecurityTypes VncAuth -passwd /home/shaojiemike/.vnc/passwd :2 to connect to the VNC server.
reduced version of RV32I designed for embedded systems. The only change is to reduce the number of integer registers to 16.
RV64I Base Integer Instruction Set
builds upon the RV32I variant。需要注意的一点,是访问的寄存器和寄存里的地址变成64位了,指令长度还是32位。
register: RV64I widens the integer registers and supported user address space to 64 bits
如果想要在RV64I里运行32位的指令,在指令后加后缀W就行。比如ADDIW
Additional instruction variants are provided to manipulate 32-bit values in RV64I, indicated by a ‘W’ suffix to the opcode.These “*W” instructions ignore the upper 32 bits of their inputs and always produce 32-bit signed values,
访存相关
The LD instruction loads a 64-bit value from memory into register rd for RV64I.
The LW instruction loads a 32-bit value from memory and sign-extends this to 64 bits before storing it in register rd for RV64I. The LWU instruction, on the other hand, zero-extends the 32-bit value from memory for RV64I. LH and LHU are defined analogously for 16-bit values, as are LB and LBU for 8-bit values. The SD, SW, SH, and SB instructions store 64-bit, 32-bit, 16-bit, and 8-bit values from the low bits of register rs2 to memory respectively.
The x86 architecture has 8 General-Purpose Registers (GPR), 6 Segment Registers, 1 Flags Register and an Instruction Pointer. 64-bit x86 has additional registers.
AMBA(Advanced Microcontroller Bus Architecture)是ARM公司定义的一个总线架构,用来连接不同的功能模块(如CPU核心、内存控制器、I/O端口等)。AMBA是一种开放标准,用于连接和管理集成在SOC(System on Chip)上的各种组件。它是为了高带宽和低延迟的内部通信而设计的,确保不同组件之间的高效数据传输。
ARM的SCP和MCP固件(System Control Processor & Management Control Processor firmware)则是指ARM提供的用于系统控制处理器和管理控制处理器的固件。这些固件通常负责处理系统管理任务,例如电源管理、系统启动和监控、安全性管理等。SCP和MCP是ARM架构中用于系统级管理和控制的专门处理器或子系统。
# 128位and and %q3 %q7 -> %q3 and v3.16b, v3.16b, v7.16b
是同一个意思,但是不支持and v3.8h, v3.8h, v7.8h
1 2 3
DUP //Duplicate general-purpose register to vector.or Duplicate vector element to vector or scalar. addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes //the scalar result into the destination SIMD&FP register.
calculate
1 2 3 4 5
add addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes the scalar result into the destination SIMD&FP register. adds // Add , setting flags. eor // Bitwise Exclusive OR orr // Move (register) copies the value in a source register to the destination register. Alias of ORR.
Address
1
ADRP // Form PC-relative address to 4KB page.
Branch
1 2 3 4 5 6 7
b.cond // branch condition eg. b.ne bl //Branch with Link branches to a PC-relative offset, setting the register X30 to PC+4 //带链接的跳转。 首先将当前指令的下一条指令地址保存在LR寄存器,然后跳转的lable。通常用于调用子程序,可通过在子程序的尾部添加mov pc, lr 返回。 blr //Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4. cbnz //Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags. tbnz // test and branch not zero ret //Return from subroutine, branches unconditionally to an address in a register, with a hint that this is a subroutine return.
Load/Store
1 2 3 4 5 6 7 8 9
ldrb // b是byte的意思 ldar // LDAR Load-Acquire(申请锁) Register STLR //Store-Release(释放锁) Register ldp // load pair(two) register stp // store pair(two) register ldr(b/h/sb/sh/sw) // load register , sb/sh/sw is signed byte/half/word str // store register ldur // load register (unscaled) unscaled means that in the machine-code, the offset will not be encoded with a scaled offset like ldr uses. or offset is minus. prfm // prefetch memory
Control/conditional
1 2 3 4 5 6
ccmp // comdition compare CMEQ // Compare bitwise Equal (vector). This instruction compares each vector element from the frst source SIMD&FP register with the corresponding vector element from the second source SIMD&FP register CSEL // If the condition is true, Conditional Select writes the value of the frst source register to the destination register. If the condition is false, it writes the value of the second source register to the destination register. CSINC //Conditional Select Increment returns CSINV //Conditional Select Invert returns CSNEG //Conditional Select Negation returns
Logic&Move
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ASRV //Arithmetic Shift Right Variable lsl //logic shift left orr //bitwise(逐位) or eor //Bitwise Exclusive OR TST/ANDS //Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS. MOVZ //Move wide with zero moves an optionally-shifted 16-bit immediate value to a register UBFM // Unigned Bitfield Move. This instruction is used by the aliases LSL (immediate), LSR (immediate), UBFIZ, UBFX, UXTB, and UXTH BFM //Bitfield Move BIC (shifted register) //Bitwise Bit Clear CLZ // Count Leading Zeros counts the number of binary zero bits before the frst binary one bit in the value of the source register, and writes the result to the destination register. REV, REV16, REVSH, and RBIT // below REV //Reverse byte order in a word. REV16 //Reverse byte order in each halfword independently. REVSH //Reverse byte order in the bottom halfword, and sign extend to 32 bits. RBIT //Reverse the bit order in a 32-bit word.
Modifier
1
uxtb // zero extend byte 无符号(Unsigned)扩展一个字节(Byte)到 32位
system
1 2
dmb //data memory barrier SVC //The SVC instruction causes an exception. This means that the processor mode changes to Supervisor,
_mm_sin_ps intrinsic is a packed 128-bit vector of four32-bit precision floating point numbers.The intrinsic computes the sine of each of these four numbers and returns the four results in a packed 128-bit vector.
ISA
AVX2 & AVX
AVX2在AVX的基础上完善了256位寄存器的一些实现
FMA
float-point multiply add/sub
include 128/256 bits regs
AVX_VNNI
AVX-VNNI is a VEX-coded variant of the AVX512-VNNI instruction set extension. It provides the same set of operations, but is limited to 256-bit vectors and does not support any additional features of EVEX encoding, such as broadcasting, opmask registers or accessing more than 16 vector registers. This extension allows to support VNNI operations even when full AVX-512 support is not implemented by the processor.
1 2 3
dpbusd //_mm_dpbusd_avx_epi32 dpwssd // b 与 w 是 byte 和dword。 us和ss是ab两数是不是signed dpwssds // 最后的s是 signed saturation饱和计算的意思,计算不允许越界。
AVX-512
有时间再看吧
KNC
current generation of Intel Xeon Phi co-processors (codename “Knight’s Corner“, abbreviated KNC) supports 512-bit SIMD instruction set called “Intel® Initial Many Core Instructions” (abbreviated Intel® IMCI).
Intel® Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two components:
A set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image
An accelerator that is able to operate on tiles; the first implementation of this accelerator is called TMUL (tile matrix multiply unit).
这个不适用于特殊矩阵和稀疏矩阵,这类一般先转换化简再SIMD
SVML
Short Vector Math Library Operations (SVML)
The Intel® oneAPI DPC++/C++ Compiler provides short vector math library (SVML) intrinsics to compute vector math functions. These intrinsics are available for IA-32 and Intel® 64 architectures running on supported operating systems. The prototypes for the SVML intrinsics are available in the immintrin.h file.
Using SVML intrinsics is faster than repeatedly calling the scalar math functions. However, the intrinsics differ from the scalar functions in accuracy.
_mm256_hadd_epi16 // Horizontally add eg.dst[15:0] := a[31:16] + a[15:0] _mm256_mulhi_epi16 // Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst. _mm256_sign_epi16 // 根据b的值,将-a/0/a存入dst // 乘加,乘减,的计算组合也有