CS \[3|4\]20: Parallel Programming

class: center, middle

name:opening

## Lecture 19.1: General Purpose Computing on the GPU

.center[
Randal Burns

[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu)

[Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/)

19 April 2017

these slides: <http://parallel.cs.jhu.edu/...>
]

---

### Originally Prepared by:

#### Matthew Bolitho

_Then_: PhD student, Computer Graphics Lab, Johns Hopkins University

_Later_: Director of Architecture, nVIDIA

Now: ???

Presentation has been heavily modified as technology evolves

---
### Graphics Processors

What is a GPU?

- Specialized hardware used for rendering 3D graphics

.center[
<img src=http://www.nvidia.com/docs/IO/69826/GeForce_GTX_275_3qtr_large.jpg style="width: 400px;"/>
]

---

### Graphics Processors

- GPUs are highly parallel processors
  - During rendering, vertices and pixels can be processed in parallel

- GPUs are programmable
  - Traditionally through graphics pipeline (e.g., OpenGL)

---

### Why use GPUs for compute?

Processing trends

.center[
<div style="background-color:#FFFFFF">
<img src=http://2.bp.blogspot.com/-_MyhXtZT7bA/UnEWJxny5XI/AAAAAAAAAEw/oMmQoVGtmVg/s1600/cpu_vs_gpu_trends_II.png style="width: 600px;"/>
</div>
]

---

### Why use GPUs for compute?

.center[
<img src=http://michaelgalloy.com/wp-content/uploads/2013/06/cpu-vs-gpu.png style="width: 700px;"/>
]

---

### Why use GPUs for compute?

Memory bandwidth

.center[
<img src=https://image.slidesharecdn.com/vmd-with-gpus-140225212055-phpapp01/95/supercharging-md-simulations-with-gpus-13-638.jpg?cb=1393363442 style="width: 700px"/>
]

---

### Why use GPUs for compute?

Speedup

.center[
<img src="images/gpuspeedup.png" style="width: 600px">
]

---

### GPU Computing 101

.center[
<img src="images/gpuworkflow.png" style="width: 450px">
]

---

### Why so fast?

- Designed for math-intensive, parallel
  - More transistors dedicated to ALU/FPU than flow control and data cache

.center[
<img src="images/gpuarch.png" style="width: 600px">
]

---

### At what cost?

Program must be more predictable:
- Data access coherency
  - Non-sequential, non-aligned memory slow.  No hierarchical caching support.

- Program flow
  - Branching code (loops and conditions) 
    - Short pipelines and limited branch prediction

---

## Lecture 19.2: GPU Architecture

.center[
Randal Burns

[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu)

[Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/)

19 April 2017

these slides: <http://parallel.cs.jhu.edu/...>
]

---

### NVIDIA GPU Architecture

- Fundamental unit is the CUDA core
  - Integer arithmetic logic unit ALU 
  - Double-precision floating point FPU
  - Fused multiply-add instruction Fully pipelined

---

### CUDA Core

.center[
<img src=https://i.stack.imgur.com/bRgJm.jpg style="width: 700px"/>
]

---

### NVIDIA GPU Architecture

- Fundamental unit is the CUDA core
  - Integer arithmetic logic unit ALU 
  - Double-precision floating point FPU
  - Fused multiply-add instruction Fully pipelined

- CUDA processors are grouped into stream multiprocessors (SM)
  - SMs run in SIMD lockstep

---

### Pascal SM (Streaming Multiprocessor)

.center[
<img src=http://images.anandtech.com/doci/10325/gp100_SM_diagram.png style="width: 700px"/>
]

---

### NVIDIA GPU Architecture

- Fundamental unit is the CUDA core
  - Integer arithmetic logic unit ALU 
  - Double-precision floating point FPU
  - Fused multiply-add instruction Fully pipelined

- CUDA processors are grouped into stream multiprocessors (SM)
  - SMs run in SIMD lockstep

- Mutliple SMs are assembled into a GPC (graphics processing cluster)
  - think of SMs as the "cores" of the GPU processor
  - independent execution and scheduling

- Multiple GPCs make up a GPU

---

### Pascal Architecture

.center[
<img src=https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/gp100_block_diagram-1-624x368.png style="width: 700px"/>
]

---

### Intel Haswell for Comparison

Lots of die real-estate not dedicated to processing

.center[
<img src=http://img.hexus.net/v2/cpu/intel/Haswell/4770K/HVK/haswell-02b.jpg style="width: 700px"/>
]

---

### GPU Stats

Total of 3840 CUDA cores
- 64 CUDA cores per SM
- 10 SMs per GPC
- 6 GPCs

Memory bandwidth
- 732 GB/s

Aggregrate processing
- 10608 GFlops or 10+ TFlops

---

### What's new in PASCAL?

Convergence of GPUs and CPUs continues
- NVLink technology to coordinate multi-GPU programs

---

### What's new in PASCAL?

Extends path from graphics to compute
- Double-precision floating point
- Concurrent program execution and lightweight context switching
- ECC memory and some caching

Why were these not needed for graphics?