I found out using ps aux | grep -v process_name, the process is in Sl+ state. But the cpu usage is not zero.
watch "ps aux |grep 3496617" always show the same cpu usage percentage, which is very confusing beacause htop always show up-down value. and pidstat -p 3516617 show cpu% less than 100%.
NUMA : NUMA (non-uniform memory access) is a method of configuring a cluster of microprocessor in a multiprocessing system so that they can share memory locally, improving performance and the ability of the system to be expanded. NUMA is used in a symmetric multiprocessing ( SMP ) system.
Remote Direct Memory Access (RDMA) is an extension of the Direct Memory Access (DMA) technology, which is the ability to access host memory directly without CPU intervention. RDMA allows for accessing memory data from one host to another.
远程直接内存访问(英语:Remote Direct Memory Access,RDMA)是一种从一台计算机的内存到另一台计算机的内存的直接内存访问,而不涉及任何一台计算机的操作系统。这允许高吞吐量、低延迟联网,这在大规模并行计算机集群中特别有用。
Most high performance computing clusters are nowadays composed of large multicore machines that expose Non-Uniform Memory Access (NUMA), and they are interconnected using modern communication paradigms, such as Remote Direct Memory Access (RDMA).
结构类型
SISD:单指令流单数据流计算机(冯诺依曼机)
SIMD:单指令流多数据流计算机
MISD:多指令流单数据流计算机, 实际不存在
MIMD:多指令流多数据流计算机
SIMD-SM
PRAM(Parallel Random Access Machine)模型是单指令流多数据流(SIMD)并行机中的一种具有共享存储的模型。
The error message “userauth_pubkey: key type ssh-rsa not in PubkeyAcceptedAlgorithms [preauth]” indicates that the SSH server is configured to accept specific public key algorithms, and the client attempted to use the “ssh-rsa” algorithm, which is not included in the accepted algorithms list.
To resolve this issue, you have a few options:
Update SSH Key Algorithm: If you are generating a new key pair, consider using a more secure algorithm such as Ed25519 instead of the older RSA algorithm.
Update Server Configuration: If you don’t have control over the client’s key type, you may need to update the server’s SSH configuration to include support for the “ssh-rsa” algorithm. Open the SSH server configuration file (usually located at /etc/ssh/sshd_config), and add or modify the following line:
1
PubkeyAcceptedAlgorithms +ssh-rsa
After making the change, restart the SSH server.
1
sudo service ssh restart
Note: Adding “ssh-rsa” might reduce the security of your SSH server, as RSA is considered less secure than some newer algorithms.
Check Key Types: Ensure that you are using the correct key type when attempting to authenticate. If you are using an existing key, make sure it’s the right type (e.g., Ed25519) and not RSA.
Choose the option that best fits your security requirements and constraints. If possible, it’s generally recommended to use more modern and secure key algorithms like Ed25519 over older ones like RSA.
Unpacking libc6:i386 (2.35-0ubuntu3) over (2.31-0ubuntu9.9) ... Preparing to unpack .../6-libselinux1_3.3-1build2_amd64.deb ... De-configuring libselinux1:i386 (3.0-1build2) ... Unpacking libselinux1:amd64 (3.3-1build2) over (3.0-1build2) ... tar: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) tar: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) dpkg-deb: error: tar subprocess returned error exit status 1 dpkg: error processing archive /tmp/apt-dpkg-install-b0waQ0/7-libselinux1_3.3-1build2_i386.deb (--unpack): dpkg-deb --control subprocess returned error exit status 2 tar: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) tar: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) dpkg-deb: error: tar subprocess returned error exit status 1 dpkg: error processing archive /tmp/apt-dpkg-install-b0waQ0/8-libc-bin_2.35-0ubuntu3_amd64.deb (--unpack): dpkg-deb --control subprocess returned error exit status 2 Errors were encountered while processing: /tmp/apt-dpkg-install-b0waQ0/4-libc6_2.35-0ubuntu3_amd64.deb /tmp/apt-dpkg-install-b0waQ0/7-libselinux1_3.3-1build2_i386.deb /tmp/apt-dpkg-install-b0waQ0/8-libc-bin_2.35-0ubuntu3_amd64.deb /usr/bin/dpkg: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) /usr/bin/dpkg: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) /usr/bin/gdbus: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) /usr/bin/gdbus: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /lib/x86_64-linux-gnu/libselinux.so.1) E: Sub-process /usr/bin/dpkg returned an error code (1) E: Sub-process dpkg --set-selections returned an error code (1) E: Couldn't revert dpkg selection for approved remove/purge after an error was encountered!
This was the reason why you could reach proxy but couldn’t get past it, since there is no username password information. So just put that info into it.
TIP: More better add these lines in another file, /etc/apt/apt.conf.d/80proxy. This will ensure that after a version upgrade changes won’t be lost.
Could not handshake
1 2
Err:3 https://swupdate.openvpn.net/community/openvpn3/repos focal Release Could not handshake: The TLS connection was non-properly terminated. [IP: 127.0.0.1 7233]
See "systemctl status etcd.service" and "journalctl -xe"for details. invoke-rc.d: initscript etcd, action "start" failed. ● etcd.service - etcd - highly-available key value store Loaded: loaded (/lib/systemd/system/etcd.service; disabled; vendor preset: enabled) Active: failed (Result: exit-code) since Thu 2024-09-05 22:21:44 EDT; 10ms ago Docs: https://github.com/coreos/etcd man:etcd Process: 71732 ExecStart=/usr/bin/etcd $DAEMON_ARGS (code=exited, status=1/FAILURE) Main PID: 71732 (code=exited, status=1/FAILURE)
Sep 05 22:21:43 huawei systemd[1]: Starting etcd - highly-available key value store... Sep 05 22:21:44 huawei etcd[71732]: etcd on unsupported platform without ETCD_UNSUPPORTED_ARCH=arm64 set. Sep 05 22:21:44 huawei systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE Sep 05 22:21:44 huawei systemd[1]: etcd.service: Failed with result 'exit-code'. Sep 05 22:21:44 huawei systemd[1]: Failed to start etcd - highly-available key value store. dpkg: error processing package etcd-server (--configure): installed etcd-server package post-installation script subprocess returned error exit status 1 dpkg: dependency problems prevent configuration of etcd: etcd depends on etcd-server; however: Package etcd-server is not configured yet.
dpkg: error processing package etcd (--configure): dependency problems - leaving unconfigured Setting up zsh (5.4.2-3ubuntu3.2) ... No apport report written because the error message indicates its a followup error from a previous failure. Processing triggers for man-db (2.8.3-2ubuntu0.1) ... Errors were encountered while processing: etcd-server etcd E: Sub-process /usr/bin/dpkg returned an error code (1)
In the output of the lsblk command, nvme2n1 refers to a Non-Volatile Memory Express (NVMe) device. Here’s a breakdown of the naming convention:
nvme: This prefix indicates that the device is an NVMe storage device, which is a type of high-performance, solid-state drive (SSD) that connects directly to the motherboard via the PCIe bus.
# /home was on /dev/md0p1 during curtin installation /dev/disk/by-id/md-uuid-2d900913:d0a44a15:7bd846dd:a015c95e-part1 /home ext4 defaults 0 0 # /usr was on /dev/md0p2 during curtin installation /dev/disk/by-id/md-uuid-2d900913:d0a44a15:7bd846dd:a015c95e-part2 /usr ext4 defaults 0 0 # /boot was on /dev/md0p3 during curtin installation /dev/disk/by-id/md-uuid-2d900913:d0a44a15:7bd846dd:a015c95e-part3 /boot ext4 defaults 0 0 # /var was on /dev/md0p4 during curtin installation /dev/disk/by-id/md-uuid-2d900913:d0a44a15:7bd846dd:a015c95e-part4 /var ext4 defaults 0 0 # /srv was on /dev/md0p5 during curtin installation /dev/disk/by-id/md-uuid-2d900913:d0a44a15:7bd846dd:a015c95e-part5 /srv ext4 defaults 0 0
# fdisk /dev/sdb p #打印分区表 d #因为此磁盘只有一个分区sdb1,所以按d删除时候默认不会让选择要删除的分区,如果有多个分区会提示要删除的分区。 p #打印分区表,再次查看分区表,发现/dev/sdb1已经被删除 Command (m forhelp): n #新建分区 Partition type: p primary (0 primary, 0 extended, 4 free) #主分区 e extended #扩展分区 Select (default p): p #选择新建主分区 Partition number (1-4, default 1): #主分区号,会生成/dev/sdb1 First sector (2048-2097151, default 2048): #开始扇区,回车默认从2048开始 Using default value 2048 Last sector, +sectors or +size{K,M,G} (2048-2097151, default 2097151): +50M #分配主分区大小,在此为50M Partition 1 of type Linux and of size 50 MiB is set
Command (m forhelp): n #新建分区 Partition type: p primary (1 primary, 0 extended, 3 free) e extended Select (default p): e #选择创建扩展分区 Partition number (2-4, default 2): #扩展分区编号,在此我们直接回车,默认为/dev/sdb2 First sector (104448-2097151, default 104448): #默认回车,从当前扇区开始 Using default value 104448 Last sector, +sectors or +size{K,M,G} (104448-2097151, default 2097151): +500M #分配扩展分区大小,在此为500M Partition 2 of type Extended and of size 500 MiB is set
Command (m forhelp): n Partition type: p primary (1 primary, 1 extended, 2 free) l logical (numbered from 5) Select (default p): l #新建逻辑分区 Adding logical partition 5 #默认逻辑分区编号为5 First sector (106496-1128447, default 106496): #逻辑分区起始位置 Using default value 106496 Last sector, +sectors or +size{K,M,G} (106496-1128447, default 1128447): +200M #分配逻辑分区大小,在此为200M Partition 5 of type Linux and of size 200 MiB is set
Command (m forhelp): n Partition type: p primary (1 primary, 1 extended, 2 free) l logical (numbered from 5) Select (default p): l #新建第二个逻辑分区 Adding logical partition 6 First sector (518144-1128447, default 518144): Using default value 518144 Last sector, +sectors or +size{K,M,G} (518144-1128447, default 1128447): #直接回车,默认分配剩余空间 Using default value 1128447 Partition 6 of type Linux and of size 298 MiB is set
Command (m forhelp): p ... Disk label type: dos
Device Boot Start End Blocks Id System /dev/sdb1 2048 104447 51200 83 Linux /dev/sdb2 104448 1128447 512000 5 Extended /dev/sdb5 106496 516095 204800 83 Linux /dev/sdb6 518144 1128447 305152 83 Linux
# /addDisk/DiskNo1 was on /dev/sdi1 during curtin installation /dev/disk/by-uuid/0ae289c5-51f7-4ef2-a07c-6ec8d123e065 /addDisk/DiskNo1 ext4 defaults 0 0 # /addDisk/DiskNo2 was on /dev/sdl1 during curtin installation /dev/disk/by-uuid/5c2e1324-ecc5-40dd-a668-4ec682065d9f /addDisk/DiskNo2 ext4 defaults 0 0 # /addDisk/DiskNo3 was on /dev/sdj1 during curtin installation /dev/disk/by-uuid/8258b393-2e8e-41d1-9b84-0a7f88986443 /addDisk/DiskNo3 ext4 defaults 0 0 # /addDisk/DiskNo4 was on /dev/sdk1 during curtin installation /dev/disk/by-uuid/ac862e68-9c6f-424a-b4ec-e44e62f7a330 /addDisk/DiskNo4 ext4 defaults 0 0
A significant portion of pipeline slots are remaining empty. (??? 他是指有23.8% empty还是被使用了呢)
When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable to support.
This opportunity cost results in slower execution.
Long-latency operations like divides and memory operations can cause this,
as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).
针对Bad Speculation: 21.5%的建议如下:
A significant proportion of pipeline slots containing 21.5% useful work are being cancelled.
This can be caused by mispredicting branches or by machine clears. Note that this metric value may be highlighted due to Branch Resteers issue.
Retiring metric
Retiring metric represents a Pipeline Slots fraction utilized by useful work, meaning the issued uOps that eventually get retired. Retiring metric 表示有用工作所使用的Pipeline slot流水线管道的比例,所有发射的uOps最终都会retired。
Ideally, all Pipeline Slots would be attributed to the Retiring category. 理想情况下,所有的管道槽都应该归于退休类别。
Retiring of 100% would indicate the maximum possible number of uOps retired per cycle has been achieved. 100%的退役表明每个周期内退役的uop数量达到了可能的最大值。
Maximizing Retiring typically increases the Instruction-Per-Cycle metric. 最大化Retiring通常会增加IPC。
Note that a high Retiring value does not necessary mean no more room for performance improvement. For example, Microcode assists are categorized under Retiring. They hurt performance and can often be avoided.
Front-End Bound metric represents a slots fraction where the processor’s Front-End undersupplies its Back-End. 该指标表示前端产生的指令是否足以支持后端处理。
Front-End denotes the first part of the processor core responsible for fetching operations that are executed later on by the Back-End part. 前端将指令分解成uops供后端处理。
Within the Front-End, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uOps). 在前端中,分支预测器预测下一个要获取的地址,缓存行从内存子系统中获取,解析为指令,最后解码为微操作(uOps)。
Front-End Bound metric denotes unutilized issue-slots when there is no Back-End stall (bubbles where Front-End delivered no uOps while Back-End could have accepted them). For example, stalls due to instruction-cache misses would be categorized as Front-End Bound
metric represents a Pipeline Slots fraction where no uOps are being delivered due to a lack of required resources for accepting new uOps in the Back-End. 该指标表示后端uops是否出现了因为硬件资源紧张而无法处理的问题。
Back-End is the portion of the processor core where an out-of-order scheduler dispatches ready uOps into their respective execution units, and, once completed, these uOps get retired according to the program order. 后端的乱序执行,顺序Reire模型。
For example, stalls due to data-cache misses or stalls due to the divider unit(除法器?) being overloaded are both categorized as Back-End Bound. Back-End Bound is further divided into two main categories: Memory Bound and Core Bound.
Memory Bound
This metric shows how memory subsystem issues affect the performance. Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressure on the pipeline.
Core Bound
This metric represents how much Core non-memory issues were of a bottleneck. 表明核心的非内存原因成为了瓶颈
Shortage in hardware compute resources, 硬件资源的短缺
or dependencies software’s instructions are both categorized under Core Bound. 指令间的依赖
Hence it may indicate
the machine ran out of an OOO resources,
certain execution units are overloaded
or dependencies in program’s data- or instruction- flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).
Bad Speculation(分支预测错误)
represents a Pipeline Slots fraction wasted due to incorrect speculations.
This includes slots used to issue uOps that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from an earlier incorrect speculation.
For example, wasted work due to mispredicted branches is categorized as a Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.
这里的Nukes, 猜测是数据预取预测错误,带来的访存影响像核爆一样大吧.
Memory Bound
1 2 3 4 5 6 7
Memory Bound: 11.9% of Pipeline Slots L1 Bound: 7.9% L2 Bound: 0.2% L3 Bound: 2.5% DRAM Bound: 2.0% Store Bound: 0.3% NUMA: % of Remote Accesses: 13.2%
This metric shows how memory subsystem issues affect the performance. Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. 该项表明了有多少流水线的slots因为load或者store指令的需求而被迫等待
This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation 这是指不连续访存吗?
in addition to less common cases where stores could imply back-pressure on the pipeline.
L1 Bound
This metric shows how often machine was stalled without missing the L1 data cache. 在不发生L1 miss的情况下,指令stall的频率。(因为其他原因导致stall?)
The L1 cache typically has the shortest latency. However, in certain cases like loads blocked on older stores, a load might suffer a high latency even though it is being satisfied by the L1. 假设load了一个刚store的值,load指令也会遇到很大的延迟。
L2 Bound
This metric shows how often machine was stalled on L2 cache. Avoiding cache misses (L1 misses/L2 hits) will improve the latency and increase performance.
L3 Bound
This metric shows how often CPU was stalled on L3 cache, or contended with a sibling Core(与兄弟姐妹核竞争). Avoiding cache misses (L2 misses/L3 hits) improves the latency and increases performance.
DRAM Bound
This metric shows how often CPU was stalled on the main memory (DRAM). Caching typically improves the latency and increases performance.
DRAM Bandwidth Bound
This metric represents percentage of elapsed time the system spent with high DRAM bandwidth utilization. Since this metric relies on the accurate peak system DRAM bandwidth measurement, explore the Bandwidth Utilization Histogram and make sure the Low/Medium/High utilization thresholds are correct for your system. You can manually adjust them, if required.
Store Bound
This metric shows how often CPU was stalled on store operations. Even though memory store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls.
NUMA: % of Remote Accesses
In NUMA (non-uniform memory architecture) machines, memory requests missing LLC may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. It is recommended to keep as much frequently accessed data local as possible. This metric shows percent of remote accesses, the lower the better.
可以用之前的
Vectorization
This metric represents the percentage of packed (vectorized) floating point operations. 0% means that the code is fully scalar. The metric does not take into account the actual vector length that was used by the code for vector instructions. So if the code is fully vectorized and uses a legacy instruction set that loaded only half a vector length, the Vectorization metric shows 100%.
A significant fraction of floating point arithmetic instructions are scalar. Use Intel Advisor to see possible reasons why the code was not vectorized.
SP FLOPs
The metric represents the percentage of single precision floating point operations from all operations executed by the applications. Use the metric for rough estimation of a SP FLOP fraction. If FMA vector instructions are used the metric may overcount.
X87 FLOPs
The metric represents the percentage of x87 floating point operations from all operations executed by the applications. Use the metric for rough estimation of an x87 fraction. If FMA vector instructions are used the metric may overcount.
This metric represents the ratio between arithmetic floating point instructions and memory write instructions. A value less than 0.5 indicates unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution.
-mtriple=<target triple> eg. -mtriple=x86_64-unknown-unknown -march=<arch> Specify the architecture for which to analyze the code. It defaults to the host default target. -march=<arch> Specify the architecture for which to analyze the code. It defaults to the host default target.
A delta between Dispatch Width and this field is an indicator of a performance issue.
The delta between the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is an indicator of a performance bottleneck caused by the lack of hardware resources, and the Resource pressure view can help to identify the problematic resource usage.
Dispatch Width
发射到乱序后端的最大微指令操作数(the maximum number of micro opcodes/uOps)?
Block RThroughput (Block Reciprocal Throughput)
在不考虑循环依赖的情况下,理论上的每次循环的最大block或者iterations数
受限于dispatch rate和the availability of hardware resources.
Average Wait times (based on the timeline view): [0]: Executions [1]: Average time spent waiting in a scheduler's queue [2]: Average time spent waiting in a scheduler's queue while ready [3]: Average time elapsed from WB until retire stage
* The size of the **dispatch group** is smaller than processor’s dispatch width. * There are enough entries in the **reorder buffer**. * There are enough **physical registers** to do register renaming. * The schedulers are **not full**.
llvm-mca’s scheduler internally groups instructions into three sets:
* WaitSet: a set of instructions whose operands are not ready. * ReadySet: a set of instructions ready to execute. * IssuedSet: a set of instructions executing.
### Write-Back and Retire Stage
retire control unit
1. When instructions are executed,the flags the instruction as “ready to retire.” 2. Instructions are retired in program order 3. free the physical registers
### Load/Store Unit and Memory Consistency Model
load/store unit (LSUnit)用来模拟乱序memory操作
The rules are:
1. A younger load is allowed to pass an older load only if there are no intervening stores or barriers between the two loads. 2. A younger load is allowed to pass an older store provided that the load does not alias with the store. 3. A younger store is not allowed to pass an older store.不能交换顺序的意思 4. A younger store is not allowed to pass an older load.
假设 loads do not alias (-noalias=true) store operations.Under this assumption, younger loads are always allowed to pass older stores. ???
in the case of write-combining memory, rule 3 could be relaxed to allow reordering of non-aliasing store operations.???
LSUnit不管的其余三点:
1. The LSUnit does not know when store-to-load forwarding may occur. 2. The LSUnit does not know anything about cache hierarchy and memory types. 3. The LSUnit does not know how to identify serializing operations and memory fences. 4. The LSUnit does not attempt to predict if a load or store hits or misses the L1 cache(不考虑cache命中,默认是命中L1,产生the load-to-use latency的最乐观开销)
有序处理器被建模为单个 InOrderIssueStage 阶段。它绕过 Dispatch、Scheduler 和 Load/Store 单元。一旦它们的操作数寄存器可用并且满足资源要求,就会发出指令。根据LLVM的调度模型中IssueWidth参数的值,可以在一个周期内发出多条指令。一旦发出,指令就会被移到 IssuedInst 集,直到它准备好retire。 llvm-mca 确保按顺序提交写入。但是,如果 RetireOOO 属性for at least one of its writes为真,则允许指令提交写入并无序retire???
(llvm-mca detects Intel syntax by the presence of an .intel_syntax directive at the beginning of the input. By default its output syntax matches that of its input.)