class: center, middle name:opening ## Lecture 19.1: General Purpose Computing on the GPU
.center[ Randal Burns
[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu) [Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/) 19 April 2017 these slides:
] ---
### Originally Prepared by:
#### Matthew Bolitho
_Then_: PhD student, Computer Graphics Lab, Johns Hopkins University _Later_: Director of Architecture, nVIDIA Now: ???
Presentation has been heavily modified as technology evolves --- ### Graphics Processors
What is a GPU? - Specialized hardware used for rendering 3D graphics
.center[
] --- ### Graphics Processors
- GPUs are highly parallel processors - During rendering, vertices and pixels can be processed in parallel - GPUs are programmable - Traditionally through graphics pipeline (e.g., OpenGL) --- ### Why use GPUs for compute?
Processing trends .center[
] --- ### Why use GPUs for compute?
.center[
] --- ### Why use GPUs for compute?
Memory bandwidth .center[
] --- ### Why use GPUs for compute?
Speedup
.center[
] --- ### GPU Computing 101
.center[
] --- ### Why so fast? - Designed for math-intensive, parallel - More transistors dedicated to ALU/FPU than flow control and data cache .center[
] --- ### At what cost?
Program must be more predictable: - Data access coherency - Non-sequential, non-aligned memory slow. No hierarchical caching support. - Program flow - Branching code (loops and conditions) - Short pipelines and limited branch prediction
--- ## Lecture 19.2: GPU Architecture
.center[ Randal Burns
[Parallel Programing EN.601.\[3|4\]20](http://parallel.cs.jhu.edu) [Department of Computer Science, Johns Hopkins University](http://www.cs.jhu.edu/) 19 April 2017 these slides:
] --- ### NVIDIA GPU Architecture
- Fundamental unit is the CUDA core - Integer arithmetic logic unit ALU - Double-precision floating point FPU - Fused multiply-add instruction Fully pipelined --- ### CUDA Core
.center[
] --- ### NVIDIA GPU Architecture
- Fundamental unit is the CUDA core - Integer arithmetic logic unit ALU - Double-precision floating point FPU - Fused multiply-add instruction Fully pipelined - CUDA processors are grouped into stream multiprocessors (SM) - SMs run in SIMD lockstep --- ### Pascal SM (Streaming Multiprocessor)
.center[
] --- ### NVIDIA GPU Architecture
- Fundamental unit is the CUDA core - Integer arithmetic logic unit ALU - Double-precision floating point FPU - Fused multiply-add instruction Fully pipelined - CUDA processors are grouped into stream multiprocessors (SM) - SMs run in SIMD lockstep - Mutliple SMs are assembled into a GPC (graphics processing cluster) - think of SMs as the "cores" of the GPU processor - independent execution and scheduling - Multiple GPCs make up a GPU --- ### Pascal Architecture
.center[
] --- ### Intel Haswell for Comparison
Lots of die real-estate not dedicated to processing .center[
] --- ### GPU Stats
Total of 3840 CUDA cores - 64 CUDA cores per SM - 10 SMs per GPC - 6 GPCs Memory bandwidth - 732 GB/s Aggregrate processing - 10608 GFlops or 10+ TFlops --- ### What's new in PASCAL?
Convergence of GPUs and CPUs continues - NVLink technology to coordinate multi-GPU programs
--- ### What's new in PASCAL?
Extends path from graphics to compute - Double-precision floating point - Concurrent program execution and lightweight context switching - ECC memory and some caching Why were these not needed for graphics?