CS \[3|4\]20: Parallel Programming

name:opening

## Lecture 20.1: CUDA Execution Model
## Threads, Grids, and Data

[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu)

[Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/)

24 April 2017

these slides: <http://parallel.cs.jhu.edu/...>
]

---

### Originally Prepared by:

#### Matthew Bolitho

_Then_: PhD student, Computer Graphics Lab, Johns Hopkins University

_Later_: Director of Architecture, nVIDIA

Now: ???

Presentation has been heavily modified as technology evolves

---
### CUDA Thread

A CUDA device is a highly parallel processor that can execute hundreds or thousands of _threads_ in parallel.
* Create >1 thread per stream processor

CUDA thread is a different abstraction than OS thread
* Startup and switching costs __very__ low.  Assume 0.
* Good performance comes from creating many threads.  Typically one for every element of data in a matrix/grid/array.
* Change you mindset from OS threads.

---

### CUDA Thread (Data) Decomposition

* Programs run on 1D, 2D, or 3D __grids__
* A grid is decomposed into __thread blocks__
  * A thread block runs on a CUDA SM
* Each thread block contains __threads__
  * Each CUDA core runs a thread per cycle

---

### CUDA Thread (Data) Decomposition

---

### CUDA Thread (Data) Decomposition

Every CUDA program consists of a grid of __threads__
* It is typical to map threads to elements of array to derive a data decomposition
* This is good practice b/c it produces aligned sequential memory accesses
* It is a practice, not mandatory. the program is defined by the threads.

---

### CUDA Kernel

A __kernel__ is the program that runs on a single thread
* In standard decompositions, a thread runs a kernel on a data element

---

### CUDA Thread Blocks

A thread block may have up to 1024 threads.
* All threads in a thread block are run on the same SM.
* They can communicate via shared memory and synchronize.

This implies:
* threads in different blocks cannot synchronize 
* threads in different blocks cannot communicate

Threads in different blocks run at different times in different places

---

### CUDA Thread Block Scheduling

---

### Warps

Warps are the fundamental scheduling unit of the SM
* Warps are groups of 32 threads
* Two warp schedulers launch 64 threads per SM
* Each 32 thread warp forms a SIMD group

Thread blocks are not SIMD, Warps are!
* Warps don't need to synchronize.  They run in SIMD lock step.

---

### Warps

.center[
<img src=http://www.tankonyvtar.hu/en/tartalom/tamop412A/2011-0063_18_gpgpus_programming/images/picture149.jpg style="height: 500px;"/>
]

---

### Execution Hierachy

Decomposition corresponds to levels of abstraction
* Grid is a program maps to whole GPU
* Thread block is a portion of program on SM
* Warp is a SIMD group inside thread block
* Thread runs a kernal on a CUDA core

This hiearchy dictates concurrency/synchronization and data sharing
* There is (no surprise) a corresponding memory hierarchy

---

## Lecture 20.2: CUDA Memory

[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu)

[Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/)

24 April 2017

these slides: <http://parallel.cs.jhu.edu/...>
]

---

### CUDA Memory Hierachy

.center[
<img src=https://www.researchgate.net/profile/Marco_Nobile/publication/261069154/figure/fig1/AS:296718735298563@1447754667270/Figure-1-Schematization-of-CUDA-architecture-Schematic-representation-of-CUDA-threads.png style="height: 450px;"/>
]

---

### CUDA Memory Hierachy

shared memory: thread block

device memory: CPU (copy) and GPU
]

---

### CUDA registers

Large register file 
* 246 KB register per SM (64 cores)
* 128 KB per warp scheduler (32,768 * 32 bits)
* 1024 registers per core

Registers are used by compiled program as the target to load/store access data
* zero latency access

---

### CUDA Shared Memory

Modest sized region (48-64 KB) of low-latency memory addressable by all CUDA cores

Used by kernels
* threads load data in parallel from device memory to shared memory
* thread share and reuse the data from shared memory during computation
* turn 100+ cycle memory accesses into 0 latency accesses

---

### CUDA memory principles

Compared to a CPU, CUDA a large amount of programmable memory 
* big register file
* per SM shared memory

The CUDA compiler (registers) and kernel (shared memory) chooses what data goes 
into these regions.

And supports only small amounts of managed memory
* 16 KB L1 (per SM)
* 4 MB of L2 (for whole GPU)
To simplify hardware in order to dedicate more transistors to computation.

---

### How ineffective is the managed cache?

How long does it take to fill the:
 * 64 KB L1 cache on Pascal
 * 4 MB L2 cache on Pascal

Approach 1:  732 GB/s =
 * 87 nsec
 * 9 usec

Approach 2: cycle counting, 
 * 64 cores * 4 bytes/core -> 250 cycles
 * 3584 cores * 4 bytes/core ->  276 cycles

---

name:opening

## Lecture 20.3: CUDA Language
## A First Program

[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu)

[Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/)

24 April 2017

these slides: <http://parallel.cs.jhu.edu/...>
]

---

### CUDA Language

CUDA defines a language that is similar to C/C++
  * Mix of device and host code

Important differences for device code:
* No C/C++ runtime libraries (printf, malloc, fread)
  * most math functions have device equivalent
* No stack
  * all functions inlined, no recursion, no function pointers
* No classes, structs and unions as per C, some templates

---

### CUDA Language Extensions

CUDA specific functions
* \_\_syncthreads()
* atomicAdd(), atomicCAS(), atomicMin()
* execution configuration

Syntactic extensions
* Declaration specifiers
* Built-in variables
* Built-in types

---

### First CUDA program

We will create a simple CUDA program to add two vectors

U = {u0, u1, … un}

V = {v0, v1, … vn}

W = U + V = {u0 + v0, u1 + v1, … un + vn}

---

### Threads

Each kernel (thread) sums two elements to produce an output element
 
.center[
<img src=./images/uvwthreads.png style="width: 600px;"/>
]

---

### Blocks and Grid

Threads define the block and grid, but it is typical to represent the blocks and grids as the implied data decomposition.

---

### Kernel Function (Device Code)

```CUDA
__global__ void VectorAdditionKernel(
    const float* pVectorA, 
    const float* pVectorB, 
    float* pVectorC) 
{
  unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
  pVectorC[i] = pVectorA[i] + pVectorB[i];
}
```

The \_\_global\_\_ declaration specifier indicates that it runs on the device and is called from the host.

blockIdx, blockDim and threadIdx are builtins used to access the thread running in each kernel
  * known as vector types .x, .y, .z access dimensions

---

### Recall the GPU Compute Workflow

---

### Host Code (configure grid)
 
```
bool VectorAddition(
 unsigned N,
 const float* pHostVectorA, 
 const float* pHostVectorB, 
 float* pHostVectorC) 
{
 const unsigned BLOCKSIZE = 512;
 unsigned ThreadCount = N;
 unsigned BlockCount = N / BLOCKSIZE;
 unsigned VectorSize = ThreadCount * sizeof(float);
 ...
```
 
Entry point on the host.

---

### Host Code (setup data)
 
```
 ...
 float* pDeviceVectorA = 0;
 float* pDeviceVectorB = 0;
 float* pDeviceVectorC = 0;

cudaMalloc((void**)&pDeviceVectorA, VectorSize);
  cudaMalloc((void**)&pDeviceVectorB, VectorSize);
  cudaMalloc((void**)&pDeviceVectorC, VectorSize);

cudaMemcpy(pDeviceVectorA, pHostVectorA, VectorSize, cudaMemcpyHostToDevice);
 cudaMemcpy(pDeviceVectorB, pHostVectorB, VectorSize, cudaMemcpyHostToDevice);
 ...
```
 
Setup the contents of memory on the device
* allocate arrays on device
* move input data to device memory

---

### Host Code (launch kernel)
 
```
 ...
 VectorAdditionKernel<<<BlockCount,BLOCKSIZE>>>(
 pDeviceVectorA, 
 pDeviceVectorB, 
 pDeviceVectorC);
 ...
```
 
CUDA provides syntax <<<...>>> to launch kernel
* specify grid
* define shared memory

---

### Host Code (return results to host)
 
```
 ...
 cudaMemcpy(pHostVectorC, pDeviceVectorC, VectorSize, cudaMemcpyDeviceToHost);
 ...
```

---

### A First Code: So What?

GPU faster than CPU:
* but you have to move data to GPU
* must amortize the cost of transfer with enough compute

No benefit for matrix addition
* computation is not dense enough