SHAOJIE'S BOOK

Posted 2023-06-26Updated 2025-01-30Architecture13 minutes read (About 1908 words)

Assembly X86

关于X86 与 arm的寄存器的区别写在了arm那篇下

IDA analysis

word/ dword/ qword

In x86 terminology/documentation, a “word” is 16 bits

x86 word = 2 bytes

x86 dword = 4 bytes (double word)

x86 qword = 8 bytes (quad word)

x86 double-quad or xmmword = 16 bytes, e.g. movdqa xmm0, [rdi].

常见X86汇编

https://en.wikipedia.org/wiki/X86_instruction_listings

https://www.felixcloutier.com/x86/

https://officedaytime.com/simd512e/

官方手册第一个4800页

SHR	    # Shift right (unsigned shift right)
SAL       # Shift Arithmetically left (signed shift left)
lea       # Load Effective Address, like mov but not change Flags, can store in any register, three opts
imul      # Signed multiply
movslq    # Move doubleword to quadword with sign-extension.
movl $0x46dd0bfe, 0x804a1dc #将数值0x46dd0bfe放入0x804a1dc的地址中
movl 0x46dd0bfe, 0x804a1dc #将0x46dd0bfe地址里的内容放入0x804a1dc地址中

lea & leaq

1 2	lea -0xc(%ebp),%eax mov %eax,0x8(%esp) #常见于scanf第三个参数，lea传结果写入地址

1
2
3

// x is %rdi, result is %rax 就是计算地址，没有寻址操作
lea    0x0(,%rdi,8),%rax //result = x * 8;
lea    0x4b(,%rdi),%rax //result = x + 0x4b;

call & ret

Call 地址：返回地址入栈（等价于“Push %eip，mov 地址，%eip”；注意eip指向下一条尚未执行的指令）
ret：从栈中弹出地址，并跳到那个地址（pop %eip）

leave

leave：使栈做好返回准备，等价于

1 2	mov %ebp，%esp pop %ebp

compare order

1 2	cmpl $0x5,$0x1 jle 8048bc5 # Jump if Less or Equal 会触发，前面的 1<=5

X86 load store

X86 不像 ARM有专门的ldr， str指令。是通过mov实现的

movswl (%rdi), %eax sign-extending load from word (w) to dword (l). Intel movsx eax, word [rdi]

AVX

https://docs.oracle.com/cd/E36784_01/html/E36859/gntbd.html

vxorpd   XORPD
Bitwise Logical XOR for Double-Precision Floating-Point Values

vxorps   XORPS
Bitwise Logical XOR for Single-Precision Floating-Point Values

vmovaps  MOVAPS
Move Aligned Packed Single-Precision Floating-Point Values

test & jump

1 2	test al, al jne 0x1000bffcc

The test instruction performs a logical and of the two operands and sets the CPU flags register according to the result (which is not stored anywhere). If al is zero, the anded result is zero and that sets the Z flag. If al is nonzero, it clears the Z flag. (Other flags, such as Carry, oVerflow, Sign, Parity, etc. are affected too, but this code has no instruction testing them.)

The jne instruction alters EIP if the Z flag is not set. There is another mnemonic for the same operation called jnz.

1 2	test %eax,%eax jg <phase_4+0x35> # eax & eax > 0 jump

注意 cmp不等于 test

The TEST operation sets the flags CF and OF to zero.

The SF is set to the MSB(most significant bit) of the result of the AND.

If the result of the AND is 0, the ZF is set to 1, otherwise set to 0.

kinds of jump

AT&T syntax jmpq *0x402390(,%rax,8) into INTEL-syntax: jmp [RAX*8 + 0x402390].

ja VS jg

JUMP IF ABOVE AND JUMP IF GREATER

ja jumps if CF = 0 and ZF = 0 (unsigned Above: no carry and not equal)

jg jumps if SF = OF and ZF = 0 (signed Greater, excluding equal)

FLAGS

cmp performs a sub (but does not keep the result).

cmp eax, ebx

Let’s do the same by hand:

reg     hex value   binary value  

eax = 0xdeadc0de    ‭11011110101011011100000011011110‬
ebx = 0x1337ca5e    ‭00010011001101111100101001011110‬
 -    ----------
res   0xCB75F680    11001011011101011111011010000000

The flags are set as follows:

OF (overflow) : did bit 31 change      -> no
SF (sign)     : is bit 31 set          -> yes
CF (carry)    : is abs(ebx) < abs(eax) -> no  
ZF (zero)     : is result zero         -> no
PF (parity)   : is parity of LSB even  -> no (archaic)
AF (Adjust)   : overflow in bits 0123  -> archaic, for BCD only.

Carry Flag

Carry Flag is a flag set when:

a) two unsigned numbers were added and the result is larger than “capacity” of register where it is saved.

Ex: we wanna add two 8 bit numbers and save result in 8 bit register. In your example: 255 + 9 = 264 which is more that 8 bit register can store. So the value “8” will be saved there (264 & 255 = 8) and CF flag will be set.

b) two unsigned numbers were subtracted and we subtracted the bigger one from the smaller one.

Ex: 1-2 will give you 255 in result and CF flag will be set.

Auxiliary Flag is used as CF but when working with BCD. So AF will be set when we have overflow or underflow on in BCD calculations. For example: considering 8 bit ALU unit, Auxiliary flag is set when there is carry from 3rd bit to 4th bit i.e. carry from lower nibble to higher nibble. (Wiki link)

Overflow Flag is used as CF but when we work on signed numbers.

Ex we wanna add two 8 bit signed numbers: 127 + 2. the result is 129 but it is too much for 8bit signed number, so OF will be set.

Similar when the result is too small like -128 - 1 = -129 which is out of scope for 8 bit signed numbers.

register signed & unsigned

Positive or negative
The CPU does not know (or care) whether a number is positive or negative. The only person who knows is you. If you test SF and OF, then you treat the number as signed. If you only test CF then you treat the number as unsigned.
In order to help you the processor keeps track of all flags at once. You decide which flags to test and by doing so, you decide how to interpret the numbers.

register multiply

The computer makes use of binary multiplication(AND), followed by bit shift (in the direction in which the multiplication proceeds), followed by binary addition(OR).

1100100
0110111
=======
0000000
-1100100
--1100100
---0000000
----1100100
-----1100100
------1100100
==============
1010101111100

100 = 1.1001 * 2^6
55  = 1.10111* 2^5
100 * 55 -> 1.1001 * 1.10111 * 2^(6+5)

for more:

How computer multiplies 2 numbers?
And:
Binary multiplier - Wikipedia

Memory and Addressing Modes

声明静态代码区域

DB, DW, and DD can be used to declare one, two, and four byte data locations,

# 基本例子
.DATA		
var	DB 64  	; Declare a byte, referred to as location var, containing the value 64.
var2	DB ?	; Declare an uninitialized byte, referred to as location var2.
DB 10	; Declare a byte with no label, containing the value 10. Its location is var2 + 1.
X	DW ?	; Declare a 2-byte uninitialized value, referred to as location X.
Y	DD 30000    	; Declare a 4-byte value, referred to as location Y, initialized to 30000.

数组的声明，The DUP directive tells the assembler to duplicate an expression a given number of times. For example, 4 DUP(2) is equivalent to 2, 2, 2, 2.


Z	DD 1, 2, 3	; Declare three 4-byte values, initialized to 1, 2, and 3. The value of location Z + 8 will be 3.
bytes  	DB 10 DUP(?)	; Declare 10 uninitialized bytes starting at location bytes.
arr	DD 100 DUP(0)    	; Declare 100 4-byte words starting at location arr, all initialized to 0
str	DB 'hello',0	; Declare 6 bytes starting at the address str, initialized to the ASCII character values for hello and the null (0) byte.

寻址

32位X86机器寻址支持

最多支持32位寄存器和32位有符号常数相加
其中一个寄存器可以再乘上 2，4，8

# right
mov eax, [ebx]	; Move the 4 bytes in memory at the address contained in EBX into EAX
mov [var], ebx	; Move the contents of EBX into the 4 bytes at memory address var. (Note, var is a 32-bit constant).
mov eax, [esi-4]	; Move 4 bytes at memory address ESI + (-4) into EAX
mov [esi+eax], cl	; Move the contents of CL into the byte at address ESI+EAX
mov edx, [esi+4*ebx]    	; Move the 4 bytes of data at address ESI+4*EBX into EDX

# wrong and reason
mov eax, [ebx-ecx]	; Can only add register values
mov [eax+esi+edi], ebx    	; At most 2 registers in address computation

指定存储在地址的数据大小

1
2
3

mov BYTE PTR [ebx], 2	; Move 2 into the single byte at the address stored in EBX.
mov WORD PTR [ebx], 2	; Move the 16-bit integer representation of 2 into the 2 bytes starting at the address in EBX.
mov DWORD PTR [ebx], 2    	; Move the 32-bit integer representation of 2 into the 4 bytes starting at the address in EBX.

汇编寄存器顺序，作用方向

这和汇编器语法有关：

X86 instructions

For instructions with two operands, the first (lefthand) operand is the source operand, and the second (righthand) operand is the destination operand (that is, source->destination).

1 2	mov eax, ebx — copy the value in ebx into eax add eax, 10 — EAX ← EAX + 10

AT&T syntax

AT&T Syntax is an assembly syntax used in UNIX environments, that originates from AT&T Bell Labs. It is descended from the MIPS assembly syntax. (AT&T, American Telephone & Telegraph)

AT&T Syntax is an assembly syntax used mostly in UNIX environments or by tools like gcc that originated in that environment.

语法特点：https://stackoverflow.com/tags/att/info

需要注意的：

Operands are in destination-last order
Register names are prefixed with %, and immediates are prefixed with $
1. sub $24, %rsp reserves 24 bytes on the stack.
Operand-size is indicated with a b/w/l/q suffix on the mnemonic
1. addb $1, byte_table(%rdi) increment a byte in a static table.
2. The mov suffix (b, w, l, or q) indicates how many bytes are being copied (1, 2, 4, or 8 respectively)
imul $13, 16(%rdi, %rcx, 4), %eax 32-bit load from rdi + rcx<<2 + 16, multiply that by 13, put the result in %eax. Intel imul eax, [16 + rdi + rcx*4], 13.
movswl (%rdi), %eax sign-extending load from word (w) to dword (l). Intel movsx eax, word [rdi].

Intel syntax (used in Intel/AMD manuals).

The Intel assembler(icc,icpc我猜) uses the opposite order (destination<-source) for operands.

语法特点： https://stackoverflow.com/tags/intel-syntax/info

RISC-V

1
2
3

beq rs1, rs2, Label #RISC-V
SW rs2, imm(rs1)  # Mem[rs1+imm]=rs2 ,汇编将访存放在最后
add rd, rs1, rs2  # rd = rs1 + rs2

反汇编器

但是这个语法不是很重要，因为decompiler有选项控制语法

objdump has -Mintel flag, gdb has set disassembly-flavor intel option.

gcc -masm=intel -S or objdump -drwC -Mintel.

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Posted 2022-08-15Updated 2025-01-30Architecture17 minutes read (About 2592 words)

Inline Assembly

GCC内联汇编

1	__asm__　__volatile__("Instruction List" : Output : Input : Clobber/Modify);

__asm__或asm 用来声明一个内联汇编表达式，所以任何一个内联汇编表达式都是以它开头的，是必不可少的。
__volatile__或volatile 是可选的。如果用了它，则是向GCC 声明不允许对该内联汇编优化，否则当使用了优化选项(-O)进行编译时，GCC 将会根据自己的判断决定是否将这个内联汇编表达式中的指令优化掉。
Instruction List 是汇编指令序列。它可以是空的，比如：__asm__ __volatile__(""); 或 __asm__ ("");都是完全合法的内联汇编表达式，只不过这两条语句没有什么意义。
1. 但并非所有Instruction List 为空的内联汇编表达式都是没有意义的，比如：__asm__ ("":::"memory");就非常有意义，它向GCC 声明：“内存作了改动”，GCC 在编译的时候，会将此因素考虑进去。
2. 当在”Instruction List”中有多条指令的时候，需要用分号（；）或换行符（\n）将它们分开。
3. 指令中的操作数可以使用占位符引用C语言变量,操作数占位符最多10个,名称如下:%0,%1,…,%9。指令中使用占位符表示的操作数,总被视为long型(4个字节),
  1. 但对其施加的操作根据指令可以是字或者字节,当把操作数当作字或者字节使用时,默认为低字或者低字节。
  2. 对字节操作可以显式的指明是低字节还是次字节。方法是在%和序号之间插入一个字母,”b”代表低字节,”h”代表高字节,例如:%h1。
Output/Input
1. 格式为形如"constraint"(variable)的列表（逗号分隔)。按照出现的顺序分别与指令操作数”%0”，”%1”对应
2. 每个输出操作数的限定字符串必须包含”=”表示他是一个输出操作数。例子"=r" (value)
Clobber/Modify(由逗号格开的字符串组成)
1. 在Input/Output操作表达式所指定的寄存器，或当你为一些Input/Output操作表达式使用”r”约束，让GCC为你选择一个寄存器时，GCC知道这些寄存器是被修改的，你根本不需要在Clobber/Modify域再声明它们。
2. 但是对于”Instruction List”中的临时寄存器，需要在Clobber/Modify域声明这些寄存器或内存，让GCC知道修改了他们
  1. 例子:__asm__ ("mov R0, #0x34" : : : "R0");寄存器R0出现在”Instruction List中”，并且被mov指令修改，但却未被任何Input/Output操作表达式指定，所以你需要在Clobber/Modify域指定”R0”，以让GCC知道这一点。
3. Clobber/Modify域存在”memory”，那么GCC会保证在此内联汇编之前，如果某个内存的内容被装入了寄存器，那么在这个内联汇编之后，如果需要使用这个内存处的内容，就会直接到这个内存处重新读取，而不是使用被存放在寄存器中的拷贝。因为这个时候寄存器中的拷贝已经很可能和内存处的内容不一致了。

输入输出与指令的对应关系

寄存器约束符Operation Constraint

每一个Input和Output表达式都必须指定自己的操作约束Operation Constraint，这里将讨论在80386平台上所可能使用的操作约束。

当前的输入或输出需要借助一个寄存器时，需要为其指定一个寄存器约束，可以直接指定一个寄存器的名字。

常用的寄存器约束的缩写

约束	意义
r	表示使用一个通用寄存器，由 GCC 在%eax/%ax/%al,%ebx/%bx/%bl,%ecx/%cx/%cl,%edx/%dx/%dl中选取一个GCC认为合适的。
g	表示使用任意一个寄存器，由GCC在所有的可以使用的寄存器中选取一个GCC认为合适的。
q	表示使用一个通用寄存器，和约束r的意义相同。
a	表示使用%eax/%ax/%al
b	表示使用%ebx/%bx/%bl
c	表示使用%ecx/%cx/%cl
d	表示使用%edx/%dx/%dl
D	表示使用%edi/%di
S	表示使用%esi/%si
f	表示使用浮点寄存器
t	表示使用第一个浮点寄存器
u	表示使用第二个浮点寄存器

分类	限定符	描述
通用寄存器	“a”	将输入变量放入eax 这里有一个问题:假设eax已经被使用,那怎么办?其实很简单:因为GCC 知道eax 已经被使用,它在这段汇编代码的起始处插入一条语句pushl %eax,将eax 内容保存到堆栈,然后在这段代码结束处再增加一条语句popl %eax,恢复eax的内容
	“b”	将输入变量放入ebx
	“c”	将输入变量放入ecx
	“d”	将输入变量放入edx
	“s”	将输入变量放入esi
	“d”	将输入变量放入edi
	“q”	将输入变量放入eax,ebx,ecx,edx中的一个
	“r”	将输入变量放入通用寄存器,也就是eax,ebx,ecx,edx,esi,edi中的一个
	“A”	把eax和edx合成一个64 位的寄存器(use long longs)
内存	“m”	内存变量
	“o”	操作数为内存变量,但是其寻址方式是偏移量类型, 也即是基址寻址,或者是基址加变址寻址
	“V”	操作数为内存变量,但寻址方式不是偏移量类型
	“ “	操作数为内存变量,但寻址方式为自动增量
	“p”	操作数是一个合法的内存地址(指针)
寄存器或内存	“g”	将输入变量放入eax,ebx,ecx,edx中的一个或者作为内存变量
	“X”	操作数可以是任何类型
立即数	“I”	0-31之间的立即数(用于32位移位指令)
	“J”	0-63之间的立即数(用于64位移位指令)
	“N”	0-255之间的立即数(用于out指令)
	“i”	立即数
	“n”	立即数,有些系统不支持除字以外的立即数, 这些系统应该使用”n”而不是”i”
匹配	“ 0 “,“1” …“9”	, 表示用它限制的操作数与某个指定的操作数匹配,也即该操作数就是指定的那个操作数,例如”0”去描述”%1”操作数,那么”%1”引用的其实就是”%0”操作数,注意作为限定符字母的0-9 与指令中的”%0”-“%9”的区别,前者描述操作数,后者代表操作数。
	&;	该输出操作数不能使用过和输入操作数相同的寄存器
操作数类型	“=”	操作数在指令中是只写的(输出操作数)
	“+”	操作数在指令中是读写类型的(输入输出操作数)
浮点数	“f”	浮点寄存器
	“t”	第一个浮点寄存器
	“u”	第二个浮点寄存器
	“G”	标准的80387浮点常数
	%	该操作数可以和下一个操作数交换位置.例如addl的两个操作数可以交换顺序 (当然两个操作数都不能是立即数)
	#	部分注释,从该字符到其后的逗号之间所有字母被忽略
	*	表示如果选用寄存器,则其后的字母被忽略

内存约束

如果一个Input/Output 操作表达式的C/C++表达式表现为一个内存地址，不想借助于任何寄存器，则可以使用内存约束。比如：

1 2	__asm__("lidt%0":"=m"(__idt_addr)); __asm__("lidt%0"::"m"(__idt_addr));

修饰符	输入/输出	意义
=	O	表示此Output操作表达式是Write-Only的。

           | O               |表示此Output操作表达式是Read-Write的。

& | O |表示此Output操作表达式独占为其指定的寄存器。
% | I |表示此Input 操作表达式中的C/C++表达式可以和下一个Input操作表达式中的C/C++表达式互换

例子

Static __inline__ void __set_bit(int nr, volatile void * addr)
{
         __asm__(
                         "btsl %1,%0"
                         :"=m" (ADDR)
                         :"Ir" (nr));
}

第一个占位符%0与C 语言变量ADDR对应,第二个占位符%1与C语言变量nr对应。因此上面的汇编语句代码与下面的伪代码等价:btsl nr, ADDR

Clobber/Modify域存在”memory”的其他影响

使用”memory”是向GCC声明内存发生了变化，而内存发生变化带来的影响并不止这一点。

例如：

int main(int __argc, char* __argv[]) 
{ 
    int* __p = (int*)__argc; 
    (*__p) = 9999; 
    __asm__("":::"memory"); 
    if((*__p) == 9999) 
        return 5; 
    return (*__p); 
}

本例中，如果没有那条内联汇编语句，那个if语句的判断条件就完全是一句废话。GCC在优化时会意识到这一点，而直接只生成return 5的汇编代码，而不会再生成if语句的相关代码，而不会生成return (*__p)的相关代码。

但你加上了这条内联汇编语句，它除了声明内存变化之外，什么都没有做。

但GCC此时就不能简单的认为它不需要判断都知道 (*__p)一定与9999相等，它只有老老实实生成这条if语句的汇编代码，一起相关的两个return语句相关代码。

另外在linux内核中内存屏障也是基于它实现的include/asm/system.h中

1	# define barrier() _asm__volatile_("": : :"memory")

主要是保证程序的执行遵循顺序一致性。呵呵，有的时候你写代码的顺序，不一定是终执行的顺序，这个是处理器有关的。

Linux 源码例子

static inline char * strcpy(char * dest, const char *src)
{
	char *xdest = dest;
	__asm__ __volatile__
	("1: \tmoeb %1@+, %0@+\n\t"   "jne 1b"  //这个冒号不是分隔符
	: "=a" (dest) , "=a" (stc)
	: "0"(dest), "1" (src)
	 : "memory");
	return xdest;
}

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://blog.csdn.net/yi412/article/details/80846083

https://www.cnblogs.com/elnino/p/4313340.html

Posted 2022-08-13Updated 2025-01-30Architecture15 minutes read (About 2203 words)

Vtune Assembly Analysis

超算机器用vtune的命令行文件分析

首先找到vtune程序

> module load intel/2022.1                                    
> which icc                                                        
/public1/soft/oneAPI/2022.1/compiler/latest/linux/bin/intel64/icc                          
> cd /public1/soft/oneAPI/2022.1  
> find . -executable -type f -name "*vtune*"
./vtune/2022.0.0/bin64/vtune-worker-crash-reporter
./vtune/2022.0.0/bin64/vtune-gui.desktop
./vtune/2022.0.0/bin64/vtune-gui
./vtune/2022.0.0/bin64/vtune-agent
./vtune/2022.0.0/bin64/vtune-self-checker.sh
./vtune/2022.0.0/bin64/vtune-backend
./vtune/2022.0.0/bin64/vtune-worker
./vtune/2022.0.0/bin64/vtune
./vtune/2022.0.0/bin64/vtune-set-perf-caps.sh

vtune-gui获取可执行命令

/opt/intel/oneapi/vtune/2021.1.1/bin64/vtune -collect hotspots -knob enable-stack-collection=true -knob stack-size=4096 -data-limit=1024000 -app-working-dir /home/shaojiemike/github/IPCC2022first/build/bin -- /home/shaojiemike/github/IPCC2022first/build/bin/pivot /home/shaojiemike/github/IPCC2022first/src/uniformvector-2dim-5h.txt

编写sbatch_vtune.sh

#!/bin/bash
#SBATCH -o ./slurmlog/job_%j_rank%t_%N_%n.out
#SBATCH -p IPCC
#SBATCH -t 15:00
#SBATCH --nodes=1
#SBATCH --exclude=
#SBATCH --cpus-per-task=64
#SBATCH --mail-type=FAIL
#SBATCH [email protected]

source /public1/soft/modules/module.sh
module purge

module load intel/2022.1

logname=vtune
export OMP_PROC_BIND=close; export OMP_PLACES=cores
# ./pivot |tee ./log/$logname
/public1/soft/oneAPI/2022.1/vtune/2022.0.0/bin64/vtune -collect hotspots -knob enable-stack-collection=true -knob stack-size=4096 -data-limit=1024000 -app-working-dir /public1/home/ipcc22_0029/shaojiemike/slurm -- /public1/home/ipcc22_0029/shaojiemike/slurm/pivot /public1/home/ipcc22_0029/shaojiemike/slurm/uniformvector-2dim-5h.txt |tee ./log/$logname

log文件如下，但是将生成的trace文件r000hs导入识别不了AMD

> cat log/vtune
dim = 2, n = 500, k = 2
Using time : 452.232000 ms
max : 143 351 58880.823709
min : 83 226 21884.924801
Elapsed Time: 0.486s
   CPU Time: 3.540s
      Effective Time: 3.540s
      Spin Time: 0s
      Overhead Time: 0s
   Total Thread Count: 8
   Paused Time: 0s

Top Hotspots
Function         Module  CPU Time  % of CPU Time(%)
---------------  ------  --------  ----------------
SumDistance      pivot     0.940s             26.6%
_mm256_add_pd    pivot     0.540s             15.3%
_mm256_and_pd    pivot     0.320s              9.0%
_mm256_loadu_pd  pivot     0.300s              8.5%
Combination      pivot     0.250s              7.1%
[Others]         N/A       1.190s             33.6%

汇编

1 2	objdump -Sd ../build/bin/pivot > pivot1.s gcc -S -O3 -fverbose-asm ../src/pivot.c -o pivot_O1.s

汇编分析技巧

https://blog.csdn.net/thisinnocence/article/details/80767776

如何设置GNU和Intel汇编语法

vtune汇编实例

(没有开O3，默认值)

偏移 -64 是k

-50 是ki

CDQE复制EAX寄存器双字的符号位(bit 31)到RAX的高32位。

这里的movsdq的q在intel里的64位，相当于使用了128位的寄存器，做了64位的事情，并没有自动向量化。

生成带代码注释的O3汇编代码

如果想把 C 语言变量的名称作为汇编语言语句中的注释，可以加上 -fverbose-asm 选项：

1	gcc -S -O3 -fverbose-asm ../src/pivot.c -o pivot_O1.s

.L15:
# ../src/pivot.c:38:                 double dis = fabs(rebuiltCoordFirst - rebuiltCoordSecond);
   movsd (%rax), %xmm0 # MEM[base: _15, offset: 0B], MEM[base: _15, offset: 0B]
   subsd (%rax,%rdx,8), %xmm0 # MEM[base: _15, index: _21, step: 8, offset: 0B], tmp226
   addq $8, %rax #, ivtmp.66
# ../src/pivot.c:38:                 double dis = fabs(rebuiltCoordFirst - rebuiltCoordSecond);
   andpd %xmm2, %xmm0 # tmp235, dis
   maxsd %xmm1, %xmm0 # chebyshev, dis
   movapd %xmm0, %xmm1 # dis, chebyshev
# ../src/pivot.c:35:             for(ki=0; ki<k; ki++){
   cmpq %rax, %rcx # ivtmp.66, _115
   jne .L15 #,
.L19:
# ../src/pivot.c:32:         for(j=i+1; j<n; j++){
   addl $1, %esi #, j
# ../src/pivot.c:41:             chebyshevSum += chebyshev;
   addsd %xmm1, %xmm4 # chebyshev, <retval>
   addl %r14d, %edi # k, ivtmp.75
# ../src/pivot.c:32:         for(j=i+1; j<n; j++){
   cmpl %esi, %r15d # j, n
   jg .L13 #,
# ../src/pivot.c:32:         for(j=i+1; j<n; j++){
   addl $1, %r10d #, j
# ../src/pivot.c:32:         for(j=i+1; j<n; j++){
   cmpl %r10d, %r15d # j, n
   jne .L16 #,

vtune O3汇编分析

原本以为O3是看不了原代码与汇编的对应关系的，但实际可以-g -O3 是不冲突的。

指令的精简合并

访存指令的合并
1. 将r9 mov到 rax里，
  1. 又leaq (%r12,%r8,8), %r9。其中r12是rebuiltCoord,所以r8原本存储的是[i*k]的值
  2. rax是rebuiltCoord+[i*k]的地址，由于和i有关，index的计算在外层就计算好了。
2. rdx的值减去r8存储在rdx里
  1. rdx原本存储的是[j*k]的地址
  2. r8原本存储的是[i*k]的值
  3. rdx之后存储的是[(j-i)*k]的地址
3. data16 nop是为了对齐插入的nop
1. 值得注意的是取最大值操作，这里变成了maxsd
2. xmm0是缓存值
3. xmm1是chebyshev
4. xmm2是fabs的掩码
5. xmm4是chebyshevSum

自动循环展开形成流水

1. r14d存储k的值，所以edi存储j*k值
2. Block22后的指令验证了rdx原本存储的是[j*k]的地址
1. 最外层循环
2. 因为r14d存储k的值，r8和r11d存储了i*k的值

从汇编看不出有该操作，需要开启编译选项

自动向量化

从汇编看不出有该操作，需要开启编译选项

自动数据预取

从汇编看不出有该操作，需要开启编译选项

问题

为什么求和耗时这么多

添加向量化选项

gcc

Baseline

-mavx2 -march=core-avx2

阅读文档, 虽然全部变成了vmov，vadd的操作，但是实际还是64位的工作。
1. 这点add rax, 0x8没有变成add rax, 0x16可以体现
2. 但是avx2不是256位的向量化吗？用的还是xmm0这类的寄存器。

VADDSD (VEX.128 encoded version)
DEST[63:0] := SRC1[63:0] + SRC2[63:0]
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0

ADDSD (128-bit Legacy SSE version)
DEST[63:0] := DEST[63:0] + SRC[63:0]
DEST[MAXVL-1:64] (Unmodified)

-march=skylake-avx512

汇编代码表面没变，但是快了10s(49s - 39s)

下图是avx2的

下图是avx512的

猜测注意原因是

nop指令导致代码没对齐
不太可能和红框里的代码顺序有关

添加数据预取选项

判断机器是否支持

1 2	lscpu\|grep pref 3dnowprefetch //3DNow prefetch instructions

应该是支持的

汇编分析

虽然时间基本没变，主要是对主体循环没有进行预取操作，对其余循环(热点占比少的)有重新调整。如下图增加了预取指令

添加循环展开选项

变慢很多(39s -> 55s)

-funroll-loops

汇编实现，在最内层循环根据k的值直接跳转到对应的展开块，这里k是2。

默认是展开了8层，这应该和xmm寄存器总数有关

分析原因

循环展开的核心是形成计算和访存的流水
1. 不是简单的少几个跳转指令
2. 这种简单堆叠循环核心的循环展开，并不能形成流水。所以时间不会减少
但是完全无法解释循环控制的时间增加
2. 比如图中cmp的次数应该减半了，时间反而翻倍了

手动分块

由于数据L1能全部存储下，没有提升

手动数据预取

并没有形成想象中预取的流水。每512位取，还有重复。

每次预取一个Cache Line，后面两条指令预取的数据还有重复部分(导致时间增加 39s->61s)

想预取全部，循环每次预取了512位=64字节

手动向量化

avx2

（能便于编译器自动展开来使用所有的向量寄存器,avx2

39s -> 10s -> 8.4s 编译器

for(i=0; i<n-blockSize; i+=blockSize){
   for(j=i+blockSize; j<n-blockSize; j+=blockSize){
      for(ii=i; ii<i+blockSize; ii++){
            __m256d vi1 = _mm256_broadcast_sd(&rebuiltCoord[0*n+ii]);
            __m256d vi2 = _mm256_broadcast_sd(&rebuiltCoord[1*n+ii]);
               
            __m256d vj11 = _mm256_loadu_pd(&rebuiltCoord[0*n+j]); //读取4个点
            __m256d vj12 = _mm256_loadu_pd(&rebuiltCoord[1*n+j]);

            __m256d vj21 = _mm256_loadu_pd(&rebuiltCoord[0*n+j+4]); //读取4个点
            __m256d vj22 = _mm256_loadu_pd(&rebuiltCoord[1*n+j+4]);

            vj11 = _mm256_and_pd(_mm256_sub_pd(vi1,vj11), vDP_SIGN_Mask);
            vj12 = _mm256_and_pd(_mm256_sub_pd(vi2,vj12), vDP_SIGN_Mask);

            vj21 = _mm256_and_pd(_mm256_sub_pd(vi1,vj21), vDP_SIGN_Mask);
            vj22 = _mm256_and_pd(_mm256_sub_pd(vi2,vj22), vDP_SIGN_Mask);

            __m256d tmp = _mm256_add_pd(_mm256_max_pd(vj11,vj12), _mm256_max_pd(vj21,vj22));
            _mm256_storeu_pd(vchebyshev1, tmp);

            chebyshevSum += vchebyshev1[0] + vchebyshev1[1] + vchebyshev1[2] + vchebyshev1[3];

            // for(jj=j; jj<j+blockSize; jj++){
            //     double chebyshev = 0;
            //     int ki;
            //     for(ki=0; ki<k; ki++){
            //         double dis = fabs(rebuiltCoord[ki*n + ii] - rebuiltCoord[ki*n + jj]);
            //         chebyshev = dis>chebyshev ? dis : chebyshev;
            //     }
            //     chebyshevSum += chebyshev;
            // }
      }
   }
}

明明展开了一次，但是编译器继续展开了，总共8次。用满了YMM 16个向量寄存器。

下图是avx512，都出现寄存器ymm26了。

vhaddpd是水平的向量内加法指令

avx512

当在avx512的情况下展开4次，形成了相当工整的代码。

向量用到了寄存器ymm18，估计只能展开到6次了。
1. avx2 应该寄存器不够

最后求和的处理，编译器首先识别出了，不需要实际store。还是在寄存器层面完成了计算。并且通过三次add和两次数据移动指令自动实现了二叉树型求和。

avx2 寄存器不够会出现下面的情况。

avx求和的更快速归约

假如硬件存在四个一起归约的就好了，但是对于底层元件可能过于复杂了。

1 2	__m256d _mm256_hadd_pd (__m256d a, __m256d b); VEXTRACTF128 __m128d _mm256_extractf128_pd (__m256d a, int offset);

如果可以实现会节约一次数据移动和一次数据add。没有分析两种情况的寄存器依赖。可能依赖长度是一样的，导致优化后时间反而增加一点。

对于int还有这种实现

将横向归约全部提取到外面

并且将j的循环展开变成i的循环展开

手动向量化+手动循环展开？

支持的理由：打破了循环间的壁垒，编译器会识别出无效中间变量，在for的jump指令划出的基本块内指令会乱序执行，并通过寄存器重命名来形成最密集的计算访存流水。

不支持的理由：如果编译器为了形成某一指令的流水，占用了太多资源。导致需要缓存其他结果（比如，向量寄存器不够，反而需要额外的指令来写回，和产生延迟。

理想的平衡: 在不会达到资源瓶颈的情况下展开。

支持的分析例子

手动展开后，识别出来了连续的访存应该在一起进行，并自动调度。将+1的偏移编译器提前计算了。

如果写成macro define,可以发现编译器自动重排了汇编。

不支持的分析例子

avx2可以看出有写回的操作，把值从内存读出来压入栈中。

寄存器足够时没有这种问题

寻找理想的展开次数

由于不同代码对向量寄存器的使用次数不同，不同机器的向量寄存器个数和其他资源数不同。汇编也难以分析。在写好单次循环之后，最佳的展开次数需要手动测量。如下图，6次应该是在不会达到资源瓶颈的情况下展开来获得最大流水。

for(j=beginJ; j<n-jBlockSize; j+=jBlockSize){  /
//展开jBlockSize次
}
for(jj=j; jj<n; jj++){  //j初始值继承自上面的循环
//正常单次
}

由于基本块内乱序执行，代码的顺序也不重要。
加上寄存器重命名来形成流水的存在，寄存器名也不重要。当然数据依赖还是要正确。

对于两层循环的双层手动展开

思路：外层多load数据到寄存器，但是运行的任何时候也不要超过寄存器数量的上限（特别注意在内层循环运行一遍到末尾时）。

左图外层load了8个寄存器，但是右边只有2个。

特别注意在内层循环运行一遍到末尾时：

如图，黄框就有16个了。

注意load的速度也有区别

所以内层调用次数多，尽量用快的

1 2	_mm256_loadu_ps >> _mm256_broadcast_ss > _mm256_set_epi16 0.04 >> 0.5

vsub  vmax    ps 0.02      Latency 4
vand                       Latency 1

vadd              ps 0.80              Throughput 0.5
vhadd                      Latency 7
vcvtps2pd            2.00  Latency 7
vextractf128         0.50  Latency 3

|指令|精度|时间(吞吐延迟和实际依赖导致)|Latency|Throughput
|-|-|-|-|-|-|
|_mm256_loadu_ps /_mm256_broadcast_ss|||7|0.5
|vsub vmax | ps| 0.02 | 4|0.5
vand ||0.02| 1|0.33
vadd |ps |0.80 |4| 0.5
vhadd ||0.8| 7|2
vcvtps2pd || 2.00 | 7|1
vextractf128 || 0.50 | 3|1

向量化double变单精度没有提升

17条avx计算 5load 2cvt 2extract

单位时间	avx计算	load	cvt	extract
	2.33	3.68	12.875	4.1

可见类型转换相当耗费时间，最好在循环外，精度不够，每几次循环做一次转换。

GCC编译器优化

-march=skylake-avx512是一条指令

-mavx2 是两条指令

1 2	vmovupd xmm7, xmmword ptr [rdx+rsi8] vinsertf128 ymm1, ymm7, xmmword ptr [rdx+rsi8+0x10], 0x1

原因是不对齐的访存在老架构上可能更快

O3对于核心已经向量化的代码还有加速吗？

将IPCC初赛的代码去掉O3发现还是慢了10倍。

为什么连汇编函数调用也慢这么多呢？

这个不开O3的编译器所属有点弱智了，一条指令的两个操作数竟然在rbp的栈里存来存去的。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

无

Posted 2021-12-08Updated 2025-01-30Architecture8 minutes read (About 1237 words)

Assembly Arm

关于X86 与 arm的寄存器的区别写在了arm那篇下

arm

https://developer.arm.com/documentation/dui0068/b/CIHEDHIF

Arm 的四种寻址方式

ldr & str

Aarch64

Arm A64 Instruction Set Architecture
https://modexp.wordpress.com/2018/10/30/arm64-assembly/

直接阅读文档 Arm® A64 Instruction Set Architecture
Armv8, for Armv8-A architecture profile最有效

指令后缀说明

read from ARMv8 Instruction Set Overview 4.2 Instruction Mnemonics

The container is one of:

The subtype is one of:

combine

1	<name>{<subtype>} <container>

注意后缀的作用主体

指令速查

官网查找指令: https://developer.arm.com/architectures/instruction-sets/intrinsics

https://armconverter.com/?disasm&code=0786b04e

SIMD/vector

几乎每个指令都可以同时作用在不同寄存器和vector或者scalar上。比如add指令，并没有像X86一样设计vadd或者addps等单独
的指令，如果一定要区分，只能从寄存器是不是vector下手。

根据这个图，确实是有做向量操作的add，FADD是float-add的意思，ADDP是将相邻的寄存器相加放入目的寄存器的意思。不影响是标量scalar还是向量vector的操作。addv是将一个向量寄存器里的每个分量归约求和的意思，确实只能用在向量指令。

由于需要满足64或者128位只有下面几种情况

需要额外注意的是另外一种写法，位操作指令，不在乎寄存器形状shape

1
2
3

# 128位and
and %q3 %q7 -> %q3
and v3.16b, v3.16b, v7.16b

是同一个意思，但是不支持and v3.8h, v3.8h, v7.8h

1
2
3

DUP //Duplicate general-purpose register to vector.or Duplicate vector element to vector or scalar.
addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes 
//the scalar result into the destination SIMD&FP register.

calculate

add
addp //Add Pair of elements (scalar). This instruction adds two vector elements in the source SIMD&FP register and writes the scalar result into the destination SIMD&FP register.
adds // Add , setting ﬂags.
eor // Bitwise Exclusive OR
orr // Move (register) copies the value in a source register to the destination register. Alias of ORR.

Address

1	ADRP // Form PC-relative address to 4KB page.

Branch

b.cond // branch condition eg. b.ne
bl //Branch with Link branches to a PC-relative oﬀset, setting the register X30 to PC+4 
//带链接的跳转。 首先将当前指令的下一条指令地址保存在LR寄存器，然后跳转的lable。通常用于调用子程序，可通过在子程序的尾部添加mov  pc, lr 返回。
blr //Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4.
cbnz //Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags.
tbnz // test and branch not zero
ret //Return from subroutine, branches unconditionally to an address in a register, with a hint that this is a subroutine return.

Load/Store

ldrb // b是byte的意思
ldar // LDAR Load-Acquire(申请锁) Register 
STLR //Store-Release(释放锁) Register 
ldp // load pair(two) register
stp // store pair(two) register
ldr(b/h/sb/sh/sw) // load register , sb/sh/sw is signed byte/half/word
str // store register
ldur // load register (unscaled) unscaled means that in the machine-code, the offset will not be encoded with a scaled offset like ldr uses. or offset is minus.
prfm // prefetch memory

Control/conditional

ccmp // comdition compare
CMEQ // Compare bitwise Equal (vector). This instruction compares each vector element from the frst source SIMD&FP register with the corresponding vector element from the second source SIMD&FP register
CSEL // If the condition is true, Conditional Select writes the value of the frst source register to the destination register. If the condition is false, it writes the value of the second source register to the destination register.
CSINC //Conditional Select Increment returns
CSINV //Conditional Select Invert returns
CSNEG //Conditional Select Negation returns

Logic&Move

ASRV //Arithmetic Shift Right Variable
lsl //logic shift left
orr //bitwise(逐位) or
eor //Bitwise Exclusive OR
TST/ANDS //Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS.
MOVZ //Move wide with zero moves an optionally-shifted 16-bit immediate value to a register
UBFM // Unigned Bitfield Move. This instruction is used by the aliases LSL (immediate), LSR (immediate), UBFIZ, UBFX, UXTB, and UXTH
BFM //Bitfield Move
BIC (shifted register) //Bitwise Bit Clear
CLZ // Count Leading Zeros counts the number of binary zero bits before the frst binary one bit in the value of the source register, and writes the result to the destination register.
REV, REV16, REVSH, and RBIT // below
REV //Reverse byte order in a word.
REV16 //Reverse byte order in each halfword independently.
REVSH //Reverse byte order in the bottom halfword, and sign extend to 32 bits.
RBIT //Reverse the bit order in a 32-bit word.

Modifier

1	uxtb // zero extend byte 无符号（Unsigned）扩展一个字节（Byte）到 32位

system

1 2	dmb //data memory barrier SVC //The SVC instruction causes an exception. This means that the processor mode changes to Supervisor,

ARM no push/pop

1 2	PUSH {r3} POP {r3}

are aliases for

1 2	str r3, [sp, #-4]! ldr r3, [sp], #4

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

https://www.cs.virginia.edu/~evans/cs216/guides/x86.html

https://blog.csdn.net/gaojinshan/article/details/11534569