SHAOJIE'S BOOK

Posted 2023-07-28Updated 2025-01-30Architecture10 minutes read (About 1511 words)

objdump file

Disassembly of section .plt:

0000000000402020 <.plt>:
  402020: ff 35 e2 bf 02 00     pushq  0x2bfe2(%rip)        # 42e008 <_GLOBAL_OFFSET_TABLE_+0x8>
  402026: ff 25 e4 bf 02 00     jmpq   *0x2bfe4(%rip)        # 42e010 <_GLOBAL_OFFSET_TABLE_+0x10>
  40202c: 0f 1f 40 00           nopl   0x0(%rax)

0000000000402030 <_Znam@plt>:
  402030: ff 25 e2 bf 02 00     jmpq   *0x2bfe2(%rip)        # 42e018 <_Znam@GLIBCXX_3.4>
  402036: 68 00 00 00 00        pushq  $0x0
  40203b: e9 e0 ff ff ff        jmpq   402020 <.plt>

0000000000402040 <_ZNSo3putEc@plt>:
  402040: ff 25 da bf 02 00     jmpq   *0x2bfda(%rip)        # 42e020 <_ZNSo3putEc@GLIBCXX_3.4>
  402046: 68 01 00 00 00        pushq  $0x1
  40204b: e9 d0 ff ff ff        jmpq   402020 <.plt>

.plt节主要实现了使用过程链接表(Procedure Linkage Table)实现延迟绑定的功能。

openmp相关的执行

omp_outlined函数

问题：objdump 程序有许多类似 <.omp_outlined..16>: 的函数，但是main函数里并没有调用。实际openmp是怎么执行这些代码的呢？

在使用了OpenMP指令的C/C++程序编译后,编译器会自动生成一些名为.omp_outlined.的函数。这些函数是OpenMP所需要的运行时支持函数,不是直接在main函数中调用的,其执行方式主要有以下几种:

一些关键的运行时函数,如线程创建、同步等,是在程序启动时由OpenMP运行时库初始化和直接调用的。
对于并行region,编译器会在主线程发起并行时调用对应的.omp_outlined.函数创建线程并发布工作。
一些工作分配,同步等调用也是隐式通过运行时库的支持函数实现的。
对于并行循环,编译器会将循环体移动到.omp_outlined.函数,在循环分配工作时调用。
减少指令,作用域变化等也会引发这些函数的调用。

所以.omp_outlined.函数的执行是隐式通过运行时库触发和调度的,不需要用户代码直接调用。它们是OpenMP实现所必须的,由编译器和运行时库协调完成。用户只需要编写OpenMP指令,不必关心具体的调用细节。

总体来说,这是一种让并行执行透明化的实现机制,减少了用户的工作量。

OpenMP的汇编代码

不同平台不同，有GOMP_parallel_start开头的。也有如下x86平台的

405854:	48 c7 84 24 a0 00 00 	movq   $0x4293b9,0xa0(%rsp)
40585b:	00 b9 93 42 00 
405860:	48 8d bc 24 90 00 00 	lea    0x90(%rsp),%rdi
405867:	00 
405868:	ba 10 5f 40 00       	mov    $0x405f10,%edx
40586d:	be 02 00 00 00       	mov    $0x2,%esi
405872:	4c 89 f9             	mov    %r15,%rcx
405875:	4c 8b 44 24 20       	mov    0x20(%rsp),%r8
40587a:	31 c0                	xor    %eax,%eax
40587c:	e8 ff cb ff ff       	callq  402480 <__kmpc_fork_call@plt>
405881:	48 8b 7c 24 60       	mov    0x60(%rsp),%rdi

这段汇编代码实现了OpenMP中的并行构造,主要执行了以下几个步骤:

在栈上写入一个常量0x4293b9,可能是team的参数 (48 c7 84 24)
准备参数,获取rsp+0x90地址到rdi作为第1参数 (%rdi)
设置edx为0x405f10,可能是kmp_routine函数地址
esi设置为2,可能表示有2个参数
r15设置到rcx,传入线程号参数
r8传入栈上第0x20个参数,可能是void* shareds参数
清空eax,一些调用约定使用
调用 __kmpc_fork_call函数,这是OpenMP的runtime库函数,用来并行执行一个函数
1. kmpc fork multiple parallel call？
最后将返回值保存在rdi指定的栈空间上

所以这段代码实现了调用OpenMP runtime并行执行一个函数的操作,准备参数,调用runtime API,获取返回值的一个流程。

利用runtime库的支持函数可以实现汇编级别的OpenMP并行性。

readelf

各section位置以及含义，参考文档

$ readelf -S bfs.inj
There are 37 section headers, starting at offset 0xbe8e8: 
在文件内 0xbe8e8字节开始

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  序号 节名称               节类型          节的虚拟地址偏移量      节在文件中的偏移量
节大小         每个条目的大小（如果大小固定）  节的标志  节的链接信息    节的额外信息    节的信息对齐方式
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         00000000004002a8  000002a8
       000000000000001c  0000000000000000   A       0     0     1
  [ 2] .note.gnu.build-i NOTE             00000000004002c4  000002c4
       0000000000000024  0000000000000000   A       0     0     4
  [ 3] .note.ABI-tag     NOTE             00000000004002e8  000002e8
       0000000000000020  0000000000000000   A       0     0     4
  [ 4] .gnu.hash         GNU_HASH         0000000000400308  00000308
       000000000000005c  0000000000000000   A       5     0     8
  [ 5] .dynsym           DYNSYM           0000000000400368  00000368
       00000000000007e0  0000000000000018   A       6     1     8
  [ 6] .dynstr           STRTAB           0000000000400b48  00000b48
       0000000000000b1d  0000000000000000   A       0     0     1
  [ 7] .gnu.version      VERSYM           0000000000401666  00001666
       00000000000000a8  0000000000000002   A       5     0     2
  [ 8] .gnu.version_r    VERNEED          0000000000401710  00001710
       0000000000000110  0000000000000000   A       6     5     8
  [ 9] .rela.dyn         RELA             0000000000401820  00001820
       00000000000000f0  0000000000000018   A       5     0     8
  [10] .rela.plt         RELA             0000000000401910  00001910
       00000000000006c0  0000000000000018  AI       5    24     8
  [11] .init             PROGBITS         0000000000402000  00002000
       000000000000001b  0000000000000000  AX       0     0     4
  [12] .plt              PROGBITS         0000000000402020  00002020
       0000000000000490  0000000000000010  AX       0     0     16
  [13] .text             PROGBITS         00000000004024b0  000024b0
       0000000000026475  0000000000000000  AX       0     0     16
  [14] .fini             PROGBITS         0000000000428928  00028928
       000000000000000d  0000000000000000  AX       0     0     4
  [15] .rodata           PROGBITS         0000000000429000  00029000
       0000000000001180  0000000000000000   A       0     0     16
  [16] .eh_frame_hdr     PROGBITS         000000000042a180  0002a180
       00000000000002ac  0000000000000000   A       0     0     4
  [17] .eh_frame         PROGBITS         000000000042a430  0002a430
       0000000000001780  0000000000000000   A       0     0     8
  [18] .gcc_except_table PROGBITS         000000000042bbb0  0002bbb0
       00000000000005d0  0000000000000000   A       0     0     4
  [19] .init_array       INIT_ARRAY       000000000042dbc8  0002cbc8
       0000000000000010  0000000000000008  WA       0     0     8
  [20] .fini_array       FINI_ARRAY       000000000042dbd8  0002cbd8
       0000000000000008  0000000000000008  WA       0     0     8
  [21] .data.rel.ro      PROGBITS         000000000042dbe0  0002cbe0
       00000000000001f0  0000000000000000  WA       0     0     8
  [22] .dynamic          DYNAMIC          000000000042ddd0  0002cdd0
       0000000000000220  0000000000000010  WA       6     0     8
  [23] .got              PROGBITS         000000000042dff0  0002cff0
       0000000000000010  0000000000000008  WA       0     0     8
  [24] .got.plt          PROGBITS         000000000042e000  0002d000
       0000000000000258  0000000000000008  WA       0     0     8
  [25] .data             PROGBITS         000000000042e258  0002d258
       0000000000000010  0000000000000000  WA       0     0     8
  [26] .bss              NOBITS           000000000042e280  0002d268
       0000000000000180  0000000000000000  WA       0     0     64
  [27] .comment          PROGBITS         0000000000000000  0002d268
       000000000000004a  0000000000000001  MS       0     0     1
  [28] .debug_info       PROGBITS         0000000000000000  0002d2b2
       000000000002a06e  0000000000000000           0     0     1
  [29] .debug_abbrev     PROGBITS         0000000000000000  00057320
       0000000000000a57  0000000000000000           0     0     1
  [30] .debug_line       PROGBITS         0000000000000000  00057d77
       000000000000af9a  0000000000000000           0     0     1
  [31] .debug_str        PROGBITS         0000000000000000  00062d11
       0000000000010328  0000000000000001  MS       0     0     1
  [32] .debug_loc        PROGBITS         0000000000000000  00073039
       0000000000042846  0000000000000000           0     0     1
  [33] .debug_ranges     PROGBITS         0000000000000000  000b587f
       00000000000054c0  0000000000000000           0     0     1
  [34] .symtab           SYMTAB           0000000000000000  000bad40
       00000000000018c0  0000000000000018          35   106     8
  [35] .strtab           STRTAB           0000000000000000  000bc600
       0000000000002177  0000000000000000           0     0     1
  [36] .shstrtab         STRTAB           0000000000000000  000be777
       000000000000016c  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

字段含义

Type 字段，具体含义参考文档1-10
Link 字段中的值是节头表中节头条目的索引，索引从0开始，表示第一个节头表条目，依此类推。比如5 代表与[ 5] .dynsym 有关

值得注意

One section type, SHT_NOBITS described below, occupies no
space in the file, and its sh_offset member locates the conceptual placement in the
file.

so the number “2d258” remains unchanged.

[25] .data             PROGBITS         000000000042e258  0002d258
     0000000000000010  0000000000000000  WA       0     0     8
[26] .bss              NOBITS           000000000042e280  0002d268
     0000000000000180  0000000000000000  WA       0     0     64

.got

global offset table

.plt

This section holds the procedure linkage table. See ‘‘Special Sections’’ in Part 1 and ‘‘Procedure Linkage Table’’ in Part 2 for more information.

Function symbols (those with type STT_FUNC) in shared object files have special significance. When
another object file references a function from a shared object, the link editor automatically creates a procedure linkage table entry for the referenced symbol.

参考文档2-17 page48

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5，没有进行正确性的交叉校验。

无

Posted 2022-01-07Updated 2025-01-30Tutorials11 minutes read (About 1698 words)

OpenMP

线程绑定

OpenMP 4.0 提供 OMP_PLACES 和 OMP_PROC_BIND 环境变量来指定程序中的 OpenMP 线程如何绑定到处理器。这两个环境变量通常结合使用。OMP_PLACES 用于指定线程将绑定到的计算机位置（硬件线程、核心或插槽）。OMP_PROC_BIND 用于指定绑定策略（线程关联性策略），这项策略指定如何将线程分配到位置。

除了 OMP_PLACES 和 OMP_PROC_BIND 这两个环境变量外，OpenMP 4.0 还提供可在 parallel 指令中使用的 proc_bind 子句。proc_bind 子句用于指定如何将执行并行区域的线程组绑定到处理器。

SlURM MPI OpenMP绑定方法参考清华的文档

OMP_NUM_THREADS=28 OMP_PROC_BIND=true OMP_PLACES=cores：每个线程绑定到一个 core，使用默认的分布（线程 n 绑定到 core n）；
OMP_NUM_THREADS=2 OMP_PROC_BIND=true OMP_PLACES=sockets：每个线程绑定到一个 socket；
OMP_NUM_THREADS=4 OMP_PROC_BIND=close OMP_PLACES=cores：每个线程绑定到一个 core，线程在 socket 上连续分布（分别绑定到 core 0,1,2,3；
OMP_NUM_THREADS=4 OMP_PROC_BIND=spread OMP_PLACES=cores：每个线程绑定到一个 core，线程在 socket 上尽量散开分布（分别绑定到 core 0,7,14,21；

1
2
3

lscpu结合htop观察
NUMA 节点0 CPU：                 0-15,32-47              
NUMA 节点1 CPU：                 16-31,48-63

编译制导格式

静态扩展

文本代码在一个编译制导语句之后，被封装到一个结构块中

孤立语句

一个OpenMP的编译制导语句不依赖于其它的语句

parallel

并行域中的代码被所有的线程执行

for

for语句指定紧随它的循环语句必须由线程组并行执行；

sections

sections编译制导语句指定内部的代码被划分给线程组中的各线程

不同的section由不同的线程执行

single

single编译制导语句指定内部代码只有线程组中的一个线程执行。

线程组中没有执行single语句的线程会一直等待代码块的结束，使用nowait子句除外

来自 https://ppc.cs.aalto.fi/ch3/nowait/

组合parallel for / parallel sections 编译制导语句

Parallel for编译制导语句表明一个并行域包含一个独立的for语句
parallel sections编译制导语句表明一个并行域包含单独的一个sections语句

同步结构

master 制导语句
1. 指定代码段只有主线程执行
critical制导语句
1. critical制导语句表明域中的代码一次只能执行一个线程，其他线程被阻塞在临界区
2. 语句格式：#pragma omp critical [name] newline
barrier制导语句
1. 同步一个线程组中所有的线程,先到达的线程在此阻塞，等待其他线程

atomic制导语句

指定特定的存储单元将被原子更新

#pragma omp atomic
x++;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
5. flush制导语句
   1. 标识一个同步点，用以确保所有的线程看到一致的存储器视图
   2. ![](https://pic.shaojiemike.top/img/20220108093740.png)
6. ordered制导语句
   1. 相对于critical，多了一个顺序
   2. 只能出现在for或者parallel for语句的动态范围中
7. threadprivate语句使一个全局文件作用域的变量在并行域内变成每个线程私有
   1. 每个线程对该变量复制一份私有拷贝


### critical vs atomic
The fastest way is neither critical nor atomic. Approximately, addition with critical section is 200 times more expensive than simple addition, atomic addition is 25 times more expensive then simple addition.(**maybe no so much expensive**, the atomic operation will have a few cycle overhead (synchronizing a cache line) on the cost of roughly a cycle. A critical section incurs **the cost of a lock**.)

The fastest option (not always applicable) is to give each thread its own counter and make reduce operation when you need total sum.

### critical vs ordered
omp critical is for mutual exclusion(互斥), omp ordered refers to a specific loop and ensures that the region **executes sequentually in the order of loop iterations**. Therefore omp ordered is stronger than omp critical, but also only makes sense within a loop.

omp ordered has some other clauses, such as simd to enforce the use of a single SIMD lane only. You can also specify dependencies manually with the depend clause.

Note: Both omp critical and omp ordered regions have an implicit memory flush at the entry and the exit.

### ordered example

vector<int> v;

#pragma omp parallel for ordered schedule(dynamic, anyChunkSizeGreaterThan1)
for (int i = 0; i < n; ++i){
…
…
…
#pragma omp ordered
v.push_back(i);
}

```
tid  List of     Timeline
     iterations
0    0,1,2       ==o==o==o
1    3,4,5       ==.......o==o==o
2    6,7,8       ==..............o==o==o

= shows that the thread is executing code in parallel. o is when the thread is executing the ordered region. . is the thread being idle, waiting for its turn to execute the ordered region.

With schedule(static,1) the following would happen:

tid  List of     Timeline
     iterations
0    0,3,6       ==o==o==o
1    1,4,7       ==.o==o==o
2    2,5,8       ==..o==o==o

语句绑定与语句嵌套规则

Clauses 子句

见 https://docs.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-clauses?view=msvc-160

#pragma omp parallel for collapse(2)
for( int y = y1; y < y2; y++ )
{
	for( int x = x1; x < x2; x++ )
	{

schedule

------------------------------------------------
| static | static | dynamic | dynamic | guided |
|    1   |    5   |    1    |    5    |        |
------------------------------------------------
|    0   |    0   |    0    |    2    |    1   |
|    1   |    0   |    3    |    2    |    1   |
|    2   |    0   |    3    |    2    |    1   |
|    3   |    0   |    3    |    2    |    1   |
|    0   |    0   |    2    |    2    |    1   |
|    1   |    1   |    2    |    3    |    3   |
|    2   |    1   |    2    |    3    |    3   |
|    3   |    1   |    0    |    3    |    3   |
|    0   |    1   |    0    |    3    |    3   |
|    1   |    1   |    0    |    3    |    2   |
|    2   |    2   |    1    |    0    |    2   |
|    3   |    2   |    1    |    0    |    2   |
|    0   |    2   |    1    |    0    |    3   |
|    1   |    2   |    2    |    0    |    3   |
|    2   |    2   |    2    |    0    |    0   |
|    3   |    3   |    2    |    1    |    0   |
|    0   |    3   |    3    |    1    |    1   |
|    1   |    3   |    3    |    1    |    1   |
|    2   |    3   |    3    |    1    |    1   |
|    3   |    3   |    0    |    1    |    3   |
------------------------------------------------

private vs firstprivate vs lastprivate

private variables are not initialised, i.e. they start with random values like any other local automatic variable

firstprivate initial the value as the before value.

lastprivate save the value to the after region. 这个last的意思不是实际最后运行的一个线程，而是调度发射队列的最后一个线程。从另一个角度上说，如果你保存的值来自随机一个线程，这也是没有意义的。
firstprivate and lastprivate are just special cases of private

#pragma omp parallel
{
   #pragma omp for lastprivate(i)
      for (i=0; i<n-1; i++)
         a[i] = b[i] + b[i+1];
}
a[i]=b[i];

private vs threadprivate

A private variable is local to a region and will most of the time be placed on the stack. The lifetime of the variable’s privacy is the duration defined of the data scoping clause. Every thread (including the master thread) makes a private copy of the original variable (the new variable is no longer storage-associated with the original variable).

A threadprivate variable on the other hand will be most likely placed in the heap or in the thread local storage (that can be seen as a global memory local to a thread). A threadprivate variable persist across regions (depending on some restrictions). The master thread uses the original variable, all other threads make a private copy of the original variable (the master variable is still storage-associated with the original variable).

task 指令

可以指定某一task任务在指定第几个thread运行吗？

section 命令与 for 命令的区别

简单理解sections其实是for的展开形式，适合于少量的“任务”，并且适合于没有迭代关系的“任务”。每一个section被一个线程去执行。

常用函数

1
2
3

omp_get_thread_num() //获取线程的num，即ID。在并行区域外，获取的是master线程的ID，即为0。
omp_get_num_threads/omp_set_num_threads()  //设置/获取线程数量，用于覆盖OMP_NUM_THREADS环境变量的设置。omp_set_num_threads在串行区域调用才会有效，omp_get_num_threads获取当前线程组的线程数量，一般在并行区域调用，在串行区域调用返回为1。
omp_get_max_threads() //返回OpenMP当前环境下能创建线程的最大数量。

环境变量

OMP_SCHEDULE：只能用到for,parallel for中。它的值就是处理器中循环的次数
OMP_NUM_THREADS：定义执行中最大的线程数
OMP_DYNAMIC：通过设定变量值TRUE或FALSE,来确定是否动态设定并行域执行的线程数
OMP_NESTED：确定是否可以并行嵌套

例子


#include <omp.h>
 
int main(int argc, _TCHAR* argv[])  
{
	printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
	omp_set_num_threads(5);
	printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
 
#pragma omp parallel num_threads(5)
	{
		// omp_set_num_threads(6);	// Do not call it in parallel region
		printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
	}
 
	printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
	
	omp_set_num_threads(6);
	printf("ID: %d, Max threads: %d, Num threads: %d \n",omp_get_thread_num(), omp_get_max_threads(), omp_get_num_threads());
 
	return 0;  
}

OpenMP和pthread是常见的模型

♦OpenMP为循环级并行提供了方便的功能。线程由编译器根据用户指令创建和管理。

♦pthread提供了更复杂、更动态的方法。线程由用户显式创建和管理。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

对子句和制导的关系不清楚

参考文献

https://blog.csdn.net/gengshenghong/article/details/7004594

https://docs.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-clauses?view=msvc-160

Posted 2021-08-05Updated 2025-01-30Tutorials3 minutes read (About 383 words)

OpenMP Reductions

遇到竞争写情况怎么办

critical section

最简单的解决方案是通过声明一个critical部分来消除竞争。

double result = 0;
#pragma omp parallel num_threads(ndata)
{
  double local_result;
  int num = omp_get_thread_num();
  if (num==0)      local_result = f(x);
  else if (num==1) local_result = g(x);
  else if (num==2) local_result = h(x);
#pragma omp critical
  result += local_result;
}

double result = 0;
#pragma omp parallel
{
   double local_result;
#pragma omp for
   for (i=0; i<N; i++) {
    local_result = f(x,i);
#pragma omp critical
   result += local_result;
} // end of for loop
}

原子操作/加锁

性能是不好的，变串行了

1 2	#pragma omp atomic pi += sum;

static omp_lock_t lock;
void omp_init_lock(&lock)：初始化互斥器
void omp_destroy_lock(omp_lock*)：销毁互斥器
void omp_set_lock(omp_lock*)：获得互斥器
void omp_unset_lock(omp_lock*)：释放互斥器
void omp_test_lock(omp_lock*): 试图获得互斥器，如果获得成功则返回true，否则返回false

reduction clause 子句

将其添加到一个omp并行区域有如下效果。

OpenMP将为每个线程制作一个reduction变量的副本，初始化为reduction操作的身份，例如$1$用于乘法。
然后，每个线程将其reduce到其本地变量中。
在并行区域结束时，本地结果被合并，再次使用reduction操作，合并到全局变量。

多个变量的情况

1 2	reduction(+:x,y,z) reduction(+:array[:])

对于复杂结构体

如果代码过于复杂，还是建议复制全局变量来手工实现，最后再合并。

//错误示例
double result,local_results[3];
#pragma omp parallel
{
  int num = omp_get_thread_num();
  if (num==0)      local_results[num] = f(x)
  else if (num==1) local_results[num] = g(x)
  else if (num==2) local_results[num] = h(x)
}
result = local_results[0]+local_results[1]+local_results[2]

虽然上面这段代码是正确的，但它可能是低效的，因为有一个叫做虚假共享的现象。即使线程写到不同的变量，这些变量也可能在同一个缓存线上。这意味着核心将浪费大量的时间和带宽来更新对方的缓存线副本。

可以通过给每个线程提供自己的缓存线来防止错误的共享。

// 不是最好
double result,local_results[3][8];
#pragma omp parallel
{
  int num = omp_get_thread_num();
  if (num==0)      local_results[num][1] = f(x)
// et cetera
}

最好的方法给每个线程一个真正的局部变量，并在最后用一个critial部分对这些变量进行求和。

double result = 0;
#pragma omp parallel
{
  double local_result;
  local_result = .....
#pragam omp critical
  result += local_result;
}

默认的归约操作

Arithmetic reductions: $+,*,-,\max,\min$

Logical operator reductions in C: & && | || ^

归约变量的初始值

初始化值大多是不言而喻的，比如加法的0和乘法的1。对于min和max，它们分别是该类型的最大和最小可表示值。

用户自定义reduction的声明与使用

语法结构如下

1
2
3

#pragma omp declare reduction
    ( identifier : typelist : combiner )
    [initializer(initializer-expression)]

例子1: 取int最大

int mymax(int r,int n) {
// r is the already reduced value
// n is the new value
  int m;
  if (n>r) {
    m = n;
  } else {
    m = r;
  }
  return m;
}
#pragma omp declare reduction \
  (rwz:int:omp_out=mymax(omp_out,omp_in)) \
  initializer(omp_priv=INT_MIN)
  m = INT_MIN;
#pragma omp parallel for reduction(rwz:m)
  for (int idata=0; idata<ndata; idata++)
    m = mymax(m,data[idata]);

openmp减法归约浮点运算有精度损失

如何对vector归约

累加

#include <algorithm>
#include <vector>

#pragma omp declare reduction(vec_float_plus : std::vector<float> : \
                              std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<float>())) \
                    initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))

std::vector<float> res(n,0);
#pragma omp parallel for reduction(vec_float_plus : res)
for(size_t i=0; i<m; i++){
    res[...] += ...;
}

编辑：原始initializer很简单：initializer（omp_priv = omp_orig）。但是，如果原始副本没有全零，结果将是错误的。因此，我建议使用更复杂的initializer，它总是创建零元素向量。

求最大值

#pragma omp declare reduction(vec_double_max : std::vector<double> : \
                          std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), [](double a, double b) {return std::max(a,b);}))     \
                    initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))

#pragma omp parallel for reduction(vec_double_max:maxlab)
for( int i = 0; i < sz; i++ )
{
   maxlab[klabels[i]] = max(maxlab[klabels[i]],distlab[i]);
}

std::transform

在指定的范围内应用于给定的操作，并将结果存储在指定的另一个范围内。

需要进一步的研究学习

对vector的归约
泥菩萨:
你这么改，开-g，在vtune里面看汇编

泥菩萨:
看有没有vmm指令

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

写IPCC发现：openmp没想象中简单，

参考文献

https://stackoverflow.com/questions/43168661/openmp-and-reduction-on-stdvector

https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-reduction.html

http://www.cplusplus.com/forum/general/201500/

objdump file

openmp相关的执行

omp_outlined函数

OpenMP的汇编代码

readelf

字段含义

值得注意

.got

.plt

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

线程绑定

编译制导格式

parallel

for

sections

single

组合parallel for / parallel sections 编译制导语句

同步结构

语句绑定与语句嵌套规则

Clauses 子句

schedule

private vs firstprivate vs lastprivate

private vs threadprivate

task 指令

section 命令 与 for 命令的区别

常用函数

环境变量

例子

OpenMP和pthread是常见的模型

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

遇到竞争写情况怎么办

critical section

原子操作/加锁

reduction clause 子句

多个变量的情况

对于复杂结构体

默认的归约操作

归约变量的初始值

用户自定义reduction的声明与使用

openmp减法归约浮点运算有精度损失

如何对vector归约

累加

求最大值

std::transform

需要进一步的研究学习

遇到的问题

开题缘由、总结、反思、吐槽~~

参考文献

Categories

Subscribe for updates

follow.it

Links

Recents

Archives

Tags

section 命令与 for 命令的区别