TLB: real pagewalk overhead
简介
TLB的介绍,请看
页表相关
理论基础
大体上是应用访问越随机, 数据量越大,pgw开销越大。
ISCA 2013 shows the pgw overhead in big memory servers.
or ISCA 2020 Guvenilir 和 Patt - 2020 - Tailored Page Sizes.pdf
机器配置
1 | # shaojiemike @ snode6 in ~/github/hugoMinos on git:main x [11:17:05] |
OS config
default there is no hugopage(usually 4MB) to use.
1 | $ cat /proc/meminfo | grep huge -i |
explained is here.
设置页表大小
other ways: change source code
- way1: Linux transparent huge page (THP) support allows the kernel to automatically promote regular memory pages into huge pages,
cat /sys/kernel/mm/transparent_hugepage/enabled
but achieve this needs some details. - way2: Huge pages are allocated from a reserved pool which needs to change sys-config. for example
echo 20 > /proc/sys/vm/nr_hugepages
. And you need to write speacial C++ code to use the hugo page
1 | # using mmap system call to request huge page |
without recompile
But there is a blog using unmaintained tool hugeadm
and iodlr
library to do this.
1 | sudo apt install libhugetlbfs-bin |
So meminfo
is changed
1 | $ cat /proc/meminfo | grep huge -i |
using iodlr
library
1 | git clone |
应用测量
Measurement tools from code
1 | # shaojiemike @ snode6 in ~/github/PIA_huawei on git:main x [17:40:50] |
平均单次开销(开始到稳定):
dtlb miss read need 2450 cycle ,itlb miss read need 4027 cycle
案例的时间分布:
- 读数据开销占比不大,2.5%左右
- pagerank等图应用并行计算时,飙升至 22%
- bfs 最多就是 5%,没有那么随机的访问。
- 但是gemv 在
65000 100000
超内存前,即使是全部在计算,都是0.24%- 原因:访存模式:图应用的访存模式通常是随机的、不规则的。它们不像矩阵向量乘法(gemv)等应用那样具有良好的访存模式,后者通常以连续的方式访问内存。连续的内存访问可以利用空间局部性,通过预取和缓存块的方式减少TLB缺失的次数。
- github - GUOPS can achive 90%
- DAMOV - ligra - pagerank can achive 90% in 20M input case
gemm
- nomal gemm can achive 100% some situation
- matrix too big can not be filled in cache, matrix2 access jump lines so always cache miss
- O3 flag seems no time reduce, beacause there is no SIMD assembly in code
- memory access time = pgw + tlb access time + load data 2 cache time
the gemm
‘s core line is
1 | for(int i=0; i<N; i++){ |
and real time breakdown is as followed. to do
- first need to perf get the detail time
bigJump
manual code to test if tlb entries is run out
1 | $ ./tlbstat -c '../../test/manual/bigJump.exe 1 10 100' |
In this case, tlb miss rate up to 47/53 = 88.6%
Big bucket hash table
using big hash table
other apps
Any algorithm that does random accesses into a large memory region will likely suffer from TLB misses. Examples are plenty: binary search in a big array, large hash tables, histogram-like algorithms, etc.
需要进一步的研究学习
暂无
遇到的问题
暂无
开题缘由、总结、反思、吐槽~~
参考文献
上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。
无
TLB: real pagewalk overhead
http://icarus.shaojiemike.top/2023/08/30/Work/Architecture/microHardware/tlb/