class: center, middle name:opening ## Lecture 20.1: CUDA Execution Model ## Threads, Grids, and Data
.center[ Randal Burns
[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu) [Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/) 24 April 2017 these slides:
] ---
### Originally Prepared by:
#### Matthew Bolitho
_Then_: PhD student, Computer Graphics Lab, Johns Hopkins University _Later_: Director of Architecture, nVIDIA Now: ???
Presentation has been heavily modified as technology evolves --- ### CUDA Thread
A CUDA device is a highly parallel processor that can execute hundreds or thousands of _threads_ in parallel. * Create >1 thread per stream processor CUDA thread is a different abstraction than OS thread * Startup and switching costs __very__ low. Assume 0. * Good performance comes from creating many threads. Typically one for every element of data in a matrix/grid/array. * Change you mindset from OS threads. --- ### CUDA Thread (Data) Decomposition
* Programs run on 1D, 2D, or 3D __grids__ * A grid is decomposed into __thread blocks__ * A thread block runs on a CUDA SM * Each thread block contains __threads__ * Each CUDA core runs a thread per cycle --- ### CUDA Thread (Data) Decomposition
.center[
] --- ### CUDA Thread (Data) Decomposition
Every CUDA program consists of a grid of __threads__ * It is typical to map threads to elements of array to derive a data decomposition * This is good practice b/c it produces aligned sequential memory accesses * It is a practice, not mandatory. the program is defined by the threads. .center[
] --- ### CUDA Kernel
A __kernel__ is the program that runs on a single thread * In standard decompositions, a thread runs a kernel on a data element --- ### CUDA Thread Blocks
A thread block may have up to 1024 threads. * All threads in a thread block are run on the same SM. * They can communicate via shared memory and synchronize. This implies: * threads in different blocks cannot synchronize * threads in different blocks cannot communicate Threads in different blocks run at different times in different places --- ### CUDA Thread Block Scheduling
.center[
] --- ### Warps
Warps are the fundamental scheduling unit of the SM * Warps are groups of 32 threads * Two warp schedulers launch 64 threads per SM * Each 32 thread warp forms a SIMD group Thread blocks are not SIMD, Warps are! * Warps don't need to synchronize. They run in SIMD lock step. --- ### Warps
.center[
] --- ### Execution Hierachy
Decomposition corresponds to levels of abstraction * Grid is a program maps to whole GPU * Thread block is a portion of program on SM * Warp is a SIMD group inside thread block * Thread runs a kernal on a CUDA core This hiearchy dictates concurrency/synchronization and data sharing * There is (no surprise) a corresponding memory hierarchy --- ## Lecture 20.2: CUDA Memory
.center[ Randal Burns
[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu) [Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/) 24 April 2017 these slides:
] --- ### CUDA Memory Hierachy
.center[
] --- class: split-40 ### CUDA Memory Hierachy
.column[
] .column[ Memory type and sharing register: thread shared memory: thread block device memory: CPU (copy) and GPU ] --- ### CUDA registers
Large register file * 246 KB register per SM (64 cores) * 128 KB per warp scheduler (32,768 * 32 bits) * 1024 registers per core Registers are used by compiled program as the target to load/store access data * zero latency access --- ### CUDA Shared Memory
Modest sized region (48-64 KB) of low-latency memory addressable by all CUDA cores Used by kernels * threads load data in parallel from device memory to shared memory * thread share and reuse the data from shared memory during computation * turn 100+ cycle memory accesses into 0 latency accesses --- ### CUDA memory principles
Compared to a CPU, CUDA a large amount of programmable memory * big register file * per SM shared memory The CUDA compiler (registers) and kernel (shared memory) chooses what data goes into these regions. And supports only small amounts of managed memory * 16 KB L1 (per SM) * 4 MB of L2 (for whole GPU) To simplify hardware in order to dedicate more transistors to computation. --- ### How ineffective is the managed cache? How long does it take to fill the: * 64 KB L1 cache on Pascal * 4 MB L2 cache on Pascal Approach 1: 732 GB/s = * 87 nsec * 9 usec Approach 2: cycle counting, * 64 cores * 4 bytes/core -> 250 cycles * 3584 cores * 4 bytes/core -> 276 cycles --- class: center, middle name:opening ## Lecture 20.3: CUDA Language ## A First Program
.center[ Randal Burns
[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu) [Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/) 24 April 2017 these slides:
] --- ### CUDA Language
CUDA defines a language that is similar to C/C++ * Mix of device and host code Important differences for device code: * No C/C++ runtime libraries (printf, malloc, fread) * most math functions have device equivalent * No stack * all functions inlined, no recursion, no function pointers * No classes, structs and unions as per C, some templates --- ### CUDA Language Extensions
CUDA specific functions * \_\_syncthreads() * atomicAdd(), atomicCAS(), atomicMin() * execution configuration Syntactic extensions * Declaration specifiers * Built-in variables * Built-in types --- ### First CUDA program
We will create a simple CUDA program to add two vectors U = {u0, u1, … un} V = {v0, v1, … vn} W = U + V = {u0 + v0, u1 + v1, … un + vn}
.center[
] --- ### Threads
Each kernel (thread) sums two elements to produce an output element
.center[
] --- ### Blocks and Grid
Threads define the block and grid, but it is typical to represent the blocks and grids as the implied data decomposition.
.center[
]
.center[
] --- ### Kernel Function (Device Code)
```CUDA __global__ void VectorAdditionKernel( const float* pVectorA, const float* pVectorB, float* pVectorC) { unsigned int i = blockIdx.x * blockDim.x + threadIdx.x; pVectorC[i] = pVectorA[i] + pVectorB[i]; } ```
The \_\_global\_\_ declaration specifier indicates that it runs on the device and is called from the host. blockIdx, blockDim and threadIdx are builtins used to access the thread running in each kernel * known as vector types .x, .y, .z access dimensions --- ### Recall the GPU Compute Workflow
.center[
] --- ### Host Code (configure grid)
``` bool VectorAddition( unsigned N, const float* pHostVectorA, const float* pHostVectorB, float* pHostVectorC) { const unsigned BLOCKSIZE = 512; unsigned ThreadCount = N; unsigned BlockCount = N / BLOCKSIZE; unsigned VectorSize = ThreadCount * sizeof(float); ... ```
Entry point on the host. --- ### Host Code (setup data)
``` ... float* pDeviceVectorA = 0; float* pDeviceVectorB = 0; float* pDeviceVectorC = 0; cudaMalloc((void**)&pDeviceVectorA, VectorSize); cudaMalloc((void**)&pDeviceVectorB, VectorSize); cudaMalloc((void**)&pDeviceVectorC, VectorSize); cudaMemcpy(pDeviceVectorA, pHostVectorA, VectorSize, cudaMemcpyHostToDevice); cudaMemcpy(pDeviceVectorB, pHostVectorB, VectorSize, cudaMemcpyHostToDevice); ... ```
Setup the contents of memory on the device * allocate arrays on device * move input data to device memory --- ### Host Code (launch kernel)
``` ... VectorAdditionKernel<<
>>( pDeviceVectorA, pDeviceVectorB, pDeviceVectorC); ... ```
CUDA provides syntax <<<...>>> to launch kernel * specify grid * define shared memory --- ### Host Code (return results to host)
``` ... cudaMemcpy(pHostVectorC, pDeviceVectorC, VectorSize, cudaMemcpyDeviceToHost); ... ``` --- ### A First Code: So What?
GPU faster than CPU: * but you have to move data to GPU * must amortize the cost of transfer with enough compute No benefit for matrix addition * computation is not dense enough
.center[
]