SHAOJIE'S BOOK

Posted 2023-07-31Updated 2025-01-30Architecture3 minutes read (About 502 words)

Double 2 int8

导言

之前ipcc比赛认为很神奇的CPU侧的double2int8的转换，其实思想就是AI推理的常见低比特量化思路。

Posted 2021-12-08Updated 2025-01-30Architecture8 minutes read (About 1237 words)

Assembly Arm

关于X86 与 arm的寄存器的区别写在了arm那篇下

arm

https://developer.arm.com/documentation/dui0068/b/CIHEDHIF

Arm 的四种寻址方式

ldr & str

Aarch64

Arm A64 Instruction Set Architecture
https://modexp.wordpress.com/2018/10/30/arm64-assembly/

直接阅读文档 Arm® A64 Instruction Set Architecture
Armv8, for Armv8-A architecture profile最有效

指令后缀说明

read from ARMv8 Instruction Set Overview 4.2 Instruction Mnemonics

The container is one of:

The subtype is one of:

combine

1	<name>{<subtype>} <container>

注意后缀的作用主体

指令速查

官网查找指令: https://developer.arm.com/architectures/instruction-sets/intrinsics

https://armconverter.com/?disasm&code=0786b04e

SIMD/vector

几乎每个指令都可以同时作用在不同寄存器和vector或者scalar上。比如add指令，并没有像X86一样设计vadd或者addps等单独
的指令，如果一定要区分，只能从寄存器是不是vector下手。

根据这个图，确实是有做向量操作的add，FADD是float-add的意思，ADDP是将相邻的寄存器相加放入目的寄存器的意思。不影响是标量scalar还是向量vector的操作。addv是将一个向量寄存器里的每个分量归约求和的意思，确实只能用在向量指令。

由于需要满足64或者128位只有下面几种情况

需要额外注意的是另外一种写法，位操作指令，不在乎寄存器形状shape

1
2
3

# 128位and
and %q3 %q7 -> %q3
and v3.16b, v3.16b, v7.16b

是同一个意思，但是不支持and v3.8h, v3.8h, v7.8h

1
2
3

DUP //Duplicate general-purpose register to vector.or Duplicate vector element to vector or scalar.
addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes 
//the scalar result into the destination SIMD&FP register.

calculate

add
addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes the scalar result into the destination SIMD&FP register.
adds // Add , setting ﬂags.
eor // Bitwise Exclusive OR
orr // Move (register) copies the value in a source register to the destination register. Alias of ORR.

Address

1	ADRP // Form PC-relative address to 4KB page.

Branch

b.cond // branch condition eg. b.ne
bl //Branch with Link branches to a PC-relative oﬀset, setting the register X30 to PC+4 
//带链接的跳转。 首先将当前指令的下一条指令地址保存在LR寄存器，然后跳转的lable。通常用于调用子程序，可通过在子程序的尾部添加mov  pc, lr 返回。
blr //Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4.
cbnz //Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags.
tbnz // test and branch not zero
ret //Return from subroutine, branches unconditionally to an address in a register, with a hint that this is a subroutine return.

Load/Store

ldrb // b是byte的意思
ldar // LDAR Load-Acquire(申请锁) Register 
STLR //Store-Release(释放锁) Register 
ldp // load pair(two) register
stp // store pair(two) register
ldr(b/h/sb/sh/sw) // load register , sb/sh/sw is signed byte/half/word
str // store register
ldur // load register (unscaled) unscaled means that in the machine-code, the offset will not be encoded with a scaled offset like ldr uses. or offset is minus.
prfm // prefetch memory

Control/conditional

ccmp // comdition compare
CMEQ // Compare bitwise Equal (vector). This instruction compares each vector element from the frst source SIMD&FP register with the corresponding vector element from the second source SIMD&FP register
CSEL // If the condition is true, Conditional Select writes the value of the frst source register to the destination register. If the condition is false, it writes the value of the second source register to the destination register.
CSINC //Conditional Select Increment returns
CSINV //Conditional Select Invert returns
CSNEG //Conditional Select Negation returns

Logic&Move

ASRV //Arithmetic Shift Right Variable
lsl //logic shift left
orr //bitwise(逐位) or
eor //Bitwise Exclusive OR
TST/ANDS //Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS.
MOVZ //Move wide with zero moves an optionally-shifted 16-bit immediate value to a register
UBFM // Unigned Bitfield Move. This instruction is used by the aliases LSL (immediate), LSR (immediate), UBFIZ, UBFX, UXTB, and UXTH
BFM //Bitfield Move
BIC (shifted register) //Bitwise Bit Clear
CLZ // Count Leading Zeros counts the number of binary zero bits before the frst binary one bit in the value of the source register, and writes the result to the destination register.
REV, REV16, REVSH, and RBIT // below
REV //Reverse byte order in a word.
REV16 //Reverse byte order in each halfword independently.
REVSH //Reverse byte order in the bottom halfword, and sign extend to 32 bits.
RBIT //Reverse the bit order in a 32-bit word.

Modifier

1	uxtb // zero extend byte 无符号（Unsigned）扩展一个字节（Byte）到 32位

system

1 2	dmb //data memory barrier SVC //The SVC instruction causes an exception. This means that the processor mode changes to Supervisor,

ARM no push/pop

1 2	PUSH {r3} POP {r3}

are aliases for

1 2	str r3, [sp, #-4]! ldr r3, [sp], #4

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://www.cs.virginia.edu/~evans/cs216/guides/x86.html

https://blog.csdn.net/gaojinshan/article/details/11534569

Posted 2021-11-03Updated 2025-01-30Architecture3 minutes read (About 449 words)

Intel® Intrinsics Guide

符号说明

_mm_sin_ps intrinsic is a packed 128-bit vector of four 32-bit precision floating point numbers.The intrinsic computes the sine of each of these four numbers and returns the four results in a packed 128-bit vector.

ISA

AVX2 & AVX

AVX2在AVX的基础上完善了256位寄存器的一些实现

FMA

float-point multiply add/sub

include 128/256 bits regs

AVX_VNNI

AVX-VNNI is a VEX-coded variant of the AVX512-VNNI instruction set extension. It provides the same set of operations, but is limited to 256-bit vectors and does not support any additional features of EVEX encoding, such as broadcasting, opmask registers or accessing more than 16 vector registers. This extension allows to support VNNI operations even when full AVX-512 support is not implemented by the processor.

1
2
3

dpbusd  //_mm_dpbusd_avx_epi32
dpwssd // b 与 w 是 byte 和dword。 us和ss是ab两数是不是signed
dpwssds // 最后的s是 signed saturation饱和计算的意思，计算不允许越界。

AVX-512

有时间再看吧

KNC

current generation of Intel Xeon Phi co-processors (codename “Knight’s Corner“, abbreviated KNC) supports 512-bit SIMD instruction set called “Intel® Initial Many Core Instructions” (abbreviated Intel® IMCI).

https://stackoverflow.com/questions/22670205/are-there-simdsse-avx-instructions-in-the-x86-compatible-accelerators-intel

AMX

Intel® Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two components:

A set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image
An accelerator that is able to operate on tiles; the first implementation of this accelerator is called TMUL (tile matrix multiply unit).

这个不适用于特殊矩阵和稀疏矩阵，这类一般先转换化简再SIMD

SVML

Short Vector Math Library Operations (SVML)

The Intel® oneAPI DPC++/C++ Compiler provides short vector math library (SVML) intrinsics to compute vector math functions. These intrinsics are available for IA-32 and Intel® 64 architectures running on supported operating systems. The prototypes for the SVML intrinsics are available in the immintrin.h file.

Using SVML intrinsics is faster than repeatedly calling the scalar math functions. However, the intrinsics differ from the scalar functions in accuracy.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

Posted 2021-11-03Updated 2025-01-30Architecture5 minutes read (About 747 words)

Manual AVX256 SIMD

类型区别

The __m256 data type can hold eight 32-bit floating-point values.

The __m256d data type can hold four 64-bit double precision floating-point values.

The __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values

向量预取

1	_mm512_mask_prefetch_i32extgather_ps

Load & Store

__m256i _mm256_loadu_epi32 (void const* mem_addr) //读入连续的256位数据，为32位int
_mm256_lddqu_si256 //上面识别不了也可以考虑这个
__m256d _mm256_loadu_pd (double const * mem_addr) // 读入连续4个double
__m256d _mm256_broadcast_sd (double const * mem_addr) // 读取一个double，并复制4份
__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale) // 间隔读取
scatter // 类似间隔读取
_mm512_mask_prefetch_i32extgather_ps // 有选择预取
mask // 根据掩码选择不读，0等操作

1 2	_mm256_stream_pd // 跳过cache直接写入内存，但是需要对齐 _mm_storeu_si128 // int直接写入内存，不需要对齐

不连续读取

1
2
3

long long int vindexList = [0,2,4,6];
__m256i vindex = __mm256_loadu_epi64(vindexList);
__m256d vj1 = __mm256_i64gather_pd(&rebuiltCoord[jj*k], vindex, 1);

设置每个元素

1 2	__m256d _mm256_set_pd (double e3, double e2, double e1, double e0) // 设置为四个元素 __m256d _mm256_set1_pd (double a) // 设置为同一个元素

Arithmetic

_mm256_hadd_epi16 // Horizontally add eg.dst[15:0] := a[31:16] + a[15:0]
_mm256_mulhi_epi16 // Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.
_mm256_sign_epi16 // 根据b的值，将-a/0/a存入dst
// 乘加，乘减，的计算组合也有

横向结果归约

1	_mm256_reduce_add_ph // 求和

手动实现向量浮点abs绝对值

1
2
3

static const double DP_SIGN_One = 0x7fffffffffffffff;
__m256d vDP_SIGN_Mask = _mm256_set1_pd(DP_SIGN_One);
vj1 = _mm256_and_pd(vj1, vDP_SIGN_Mask);

Shift

1 2	_mm_bsrli_si128 // byte shift right _mm_slli_epi16 // shift left

logic

1 2	_mm_test_all_zeros _mm_test_all_ones //判断是不是全0或1

Elementary Math Functions

向量化取反、sqrt

Convert

1	_mm256_cvtepi32_pd // Convert_Int32_To_FP64

Compare

1	_mm256_cmp_pd // 按照double 32 bit 比较

Swizzle（混合）

_mm256_blendv_pd // 根据mask结果，从a和b里选择写入dst
_mm_blend_epi32 // 寄存器内数据的移动
_mm256_permute4x64_epi64 // 寄存器高位复制到低位
VEXTRACTF128 __m128d _mm256_extractf128_pd (__m256d a, int offset); // 寄存器内数据的移动
VUNPCKHPD __m512d _mm512_unpackhi_pd( __m512d a, __m512d b); //寄存器内数据的移动

类型转换

1
2

__m256d _mm256_undefined_pd (void)
__m128i low = _mm256_castsi256_si128(v);  //__m256i 变 type __m128i,源向量较低的128位不变地传递给结果。这种内在的特性不会向生成的代码引入额外的操作。

Select4(SRC, control) {
CASE (control[1:0]) OF
    0: TMP ←SRC[31:0];
    1: TMP ←SRC[63:32];
    2: TMP ←SRC[95:64];
    3: TMP ←SRC[127:96];
ESAC;
RETURN TMP
}

VSHUFPS (VEX.128 encoded version) ¶
DEST[31:0]  ←Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] ←Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] ←Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96]←Select4(SRC2[127:0], imm8[7:6]);
DEST[MAXVL-1:128] ←0

之后float类型转换为double，再求和。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_loadu_pd&ig_expand=4317

Posted 2021-10-24Updated 2025-01-30Architecturea few seconds read (About 104 words)

Arm - Neon

https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/arm-neon-programming-quick-reference

Arm cpu 向量化支持判断

向量化指令

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://blog.csdn.net/heliangbin87/article/details/79581113?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.no_search_link&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.no_search_link

Posted 2021-10-20Updated 2025-01-30Architecture8 minutes read (About 1258 words)

SIMD+SSE+AVX

SIMD

SIMD全称Single Instruction Multiple Data，单指令多数据流，能够复制多个操作数，并把它们打包在大型寄存器的一组指令集。

通过使用矢量寄存器，指令译码后几个执行部件同时访问内存，一次性获得所有操作数进行运算。这个特点使SIMD特别适合于多媒体应用等数据密集型运算。如 AMD的3D NOW！技术

MMX

MMX是由57条指令组成的SIMD多媒体指令集，MMX将64位寄存当作2个32位或8个8位寄存器来用，只能处理整形计算，这样的64位寄存器有８组，分别命名为MM0~MM7．这些寄存器不是为MMX单独设置的，而是借用的FPU的寄存器，占用浮点寄存器进行运算（64位MMX寄存器实际上就是浮点数寄存器的别名），以至于MMX指令和浮点数操作不能同时工作。为了减少在MMX和浮点数模式切换之间所消耗的时间，程序员们尽可能减少模式切换的次数，也就是说，这两种操作在应用上是互斥的。

SSE

SSE为Streaming SIMD Extensions的缩写。Intel SSE指令通过128bit位宽的专用寄存器, 支持一次操作128bit数据. float是单精度浮点数, 占32bit, 那么可以使用一条SSE指令一次计算4个float数。注意这些SSE指令要求参数中的内存地址必须对齐于16字节边界。

SSE专用矢量寄存器个数，是每个core一个吗？

SSE有8个128位寄存器，XMM0 ~XMM7。此外SSE还提供了新的控制/状态寄存器MXCSR。为了回答这个问题，我们需要了解CPU的架构。每个core是独占register的

SSE 相关编译命令

addps xmm0, xmm1 ; reg-reg
addps xmm0, [ebx] ; reg-mem
sse提供了两个版本的指令，其一以后缀ps结尾，这组指令对打包单精度浮点值执行类似mmx操作运算，而第二种后缀ss

SSE 相关函数

load系列 eg.__m128 _mm_load_ss (float *p)
store系列 eg.__m128 _mm_set_ss (float w)
其他操作 eg.__m128 _mm_add_ss (__m128 a, __m128 b)包括加法、减法、乘法、除法、开方、最大值、最小值、近似求倒数、求开方的倒数等等浮点操作

SSE指令集的发展

SSE2则进一步支持双精度浮点数，由于寄存器长度没有变长，所以只能支持２个双精度浮点计算或是４个单精度浮点计算．另外，它在这组寄存器上实现了整型计算，从而代替了MMX．
SSE3支持一些更加复杂的算术计算．
SSE4增加了更多指令，并且在数据搬移上下了一番工夫，支持不对齐的数据搬移，增加了super shuffle引擎．
由于2007年8月，AMD抢先宣布了SSE5指令集。之后Intel将新出的叫做AVX指令集。由于SSE5和AVX指令集功能类似，并且AVX包含更多的优秀特性，因此AMD决定支持AVX指令集

AVX

Advanced Vector Extensions。较新的Intel CPU都支持AVX指令集, 它可以一次操作256bit数据, 是SSE的2倍，可以使用一条AVX指令一次计算8个float数。AVX指令要求内存地址对齐于32字节边界。

SSE 与 AVX的发展

性能对比

根据参考文章，其中用gcc编译AVX版代码时需要加-mavx选项.

开启-O3选项，一般不用将代码改成多次计算和内存对齐。

判断是否向量化，看汇编

GNU

1
2
3

gcc -march=native -c -Q --help=target # 查看支持的指令集
g++ -O2 -ftree-vectorize -ftree-vectorizer-verbose=9 -S -c foo.cpp -o /dev/stdout | c++filt # 查看汇编
OBJDUMP # 反汇编

c++函数在linux系统下编译之后会变成如下样子

1	_ZNK4Json5ValueixEPKc

在linux命令行使用c++filter

1 2	$ c++filt _ZNK4Json5ValueixEPKc Json::Value::operator[](char const*) const

可以得到函数的原始名称，展开后续追踪

intel icpc

clang

-Rpass=loop-vectorize 
identifies loops that were successfully vectorized.

-Rpass-missed=loop-vectorize 
identifies loops that failed vectorization and indicates if vectorization was specified.

-Rpass-analysis=loop-vectorize 
identifies the statements that caused vectorization to fail.

常见汇编代码

1 2	xmm 寄存器 movsd

MMX指令

手动向量化

循环展开8次

例子1

SIMD寄存器

需要进一步的研究学习

暂无

遇到的问题

暂无

参考文献

https://www.dazhuanlan.com/2020/02/01/5e3475c89d5bd/

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

arm

Arm 的四种寻址方式

Aarch64

指令后缀说明

指令速查

SIMD/vector

calculate

Address

Branch

Load/Store

Control/conditional

Logic&Move

Modifier

system

ARM no push/pop

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

符号说明

ISA

AVX2 & AVX

FMA

AVX_VNNI

AVX-512

KNC

AMX

SVML

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

类型区别

向量预取

Load & Store

不连续读取

设置每个元素

Arithmetic

横向结果归约

手动实现向量浮点abs绝对值

Shift

logic

Elementary Math Functions

Convert

Compare

Swizzle（混合）

类型转换

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

Arm cpu 向量化支持判断

向量化指令

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

SIMD

MMX

SSE

SSE专用矢量寄存器个数，是每个core一个吗？

SSE 相关编译命令

SSE 相关函数

SSE指令集的发展

AVX

SSE 与 AVX的发展

性能对比

判断是否向量化，看汇编

GNU

intel icpc

clang

常见汇编代码

MMX指令

手动向量化

例子1

SIMD寄存器

需要进一步的研究学习

遇到的问题

参考文献

Categories