CUDA C Programming notes

TODO

使用cudaStream实现计算和内存拷贝重叠

简要:并不是每个Stream对应一个queue,然后两个queue就会寻找重叠的机会;
时刻意识到实际的模型是一个CopyEngine(新的结构可能支持俩个,一个to一个from),一个Kernel Engine。Engine是按照将操作加入Stream的顺序来调度的。策略可以概括为将操作按照不同streams广度优先(breadth-first)顺序排列。
CUDA by Example, Chapter 10. Jason Sanders, Edward Kandrot.

MPI CUDA Mixed Programming


CONCURRENT KERNELS
CONCURRENT KERNELS II: BATCHED LIBRARY CALLS
Parallel CUDA
nice notes

Register Level

Register Cache: Caching for Warp-Centric CUDA Programs

Warp Level

Using CUDA Warp-Level Primitives
CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

Warp-synchronous programming with Cooperative Groups

Cooperative Groups
GTC17-Cooperative Groups
Cooperative Groups: Flexible CUDA Thread Programming
对于collective computations,如reduce,scan等,需要线程协作以进行数据的共享,而这些线程也必须要进行同步。只会__syncthreads()是不够的,因为这是对整个thread block中的线程进行同步,如果我们想要在一个warp(通常为32线程)或者更细粒度上做同步呢?

Cooperative Groups 编程模型使得从intra-block到inter-block线程之间的同步成为了可能。

1
2
3
4
5
6
7
8
9
10
11
12
thread_group tile4 = tiled_partition(tile32, 4);    // partition into groups of four threads.
if (tile4.thread_rank()==0)
printf("Hello from tile4 rank 0: %d\n",
this_thread_block().thread_rank());
/**
每个group中的线程的thread_rank()别从0-size(), 在此例子中为0-3
在thread block中的rank则为4k,故输出如下:
Hello from tile4 rank 0: 0
Hello from tile4 rank 0: 4
Hello from tile4 rank 0: 8
...
*/