Introduction to Heterogenous Parallel Programming Using CUDA

Duration: 5 Days

Course Background

CUDA (Compute Unified Device Architecture) is a parallel computing platform comgined with a programming model. It was developed by NVIDIA to provide programmers with direct access to the virtual instruction set and memory of the parallel computational elements in CUDA capable GPUs. C/C++ programmers develop code using 'CUDA C/C++' and compiling it with "nvcc", NVIDIA's LLVM-based C/C++ compiler. This intensive course is designed to get programmers up to speed with designing and writing C and C++ code that can take advantage of the parallel computation capabilities offered by CUDA. A final section of the course will also consider the potential benefits of parallel programming using clusters of servers with powerful GPS CUDA capable systems.

Course Prerequisites and Target Audience

Attendees should be experienced C/C++ programmers with a sound knowledge of

Pointers and pointer operations
Computing using multidimensional arrays
File I/O and file and directory manipulation
Data structures and classes
Collection classes such as linked lists and vectors

Attendeeds are also expected to have some familiarity with

Multithreading
Memory allocation and memory management
Computer architectures and instruction sets
Basic techniques for code profiling and code optimisation
Code debugging

Course Outline

Overview of GPU computing
Data-parallel architectures and the GPU programming model
GPU memory model and thread cooperation
Introduction to CUDA
Different memory and variable types
Control flow and synchronisation
GPU memory management, simple CUDA kernels and shared memory and constant memory
Launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code
Asynchronous operations, CUDA features, overview of CUFFT, CUBLAS, Thrust, and debugging
Warp shuffles and reduction / scan operations
Multiple GPUs
Introduction to Optimisation

Arithmetic optimizations, occupancy calculations and memory access patterns
CUDA profiling
Resource management, latency and occupancy
Memory performance optimizations
Asynchronous operations
Profiling applications
OpenACC

Case studies and labs

Monte Carlo simulation using NVIDIA's CURAND library
Constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements
3D Laplace and ADI finite difference solvers
Thread block size optimisation - dynamic shared memory, thread synchronisation and reduction
Working with the CUBLAS and CUFFT libraries
Solving tri-diagonal equations - libraries, templates and g++
Scan operations and recurrence relations
Pattern matching
Auto tuning

Deplying CUDA systems in the cloud
Coarse grained vs. fine grained parallelisation
CUDA and OpenFoam - an overview

Available Courses

Introduction to Heterogenous Parallel Programming Using CUDA

Duration: 5 Days

Course Background

Course Prerequisites and Target Audience

Course Outline