CUDA programming model

API - Application Programming Interface

The API is an extension to the C programming language.

It consists of:

  • language extensions - to target portions of the code for execution on the device

  • a runtime library, split into:

    • a common component providing built-in vector types and a subset of the C runtime library in both host and device codes

    • a host component to control and access one or more devices from the host

    • a device component providing device-specific functions

The API is an extension to the ANSI C programming language - low learning curve

The hardware is designed to enable lightweight runtime and driver - high performance

Global memory: allocate, copy, free

cudaError_t cudaMalloc(void **devPtr, size_t size)

cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind)
//kind: cudaMemcpyHostToDevice, codaMemcpyDeviceToHost

cudaError_t cudaMemset(void *devPtr, int value, size_t count)

cudaError_t cudaFree(void *devPtr)

Device memory allocation

cudaMalloc():

  • allocates object in the device global memory

  • requires 2 parameters:

    • address of a pointer to the allocated object

    • size of the allocated object

cudaFree()

  • frees object from device global memory

  • pointer to a freed object

Code example

  • allocate a 64*64 single-precision float array

  • attach the allocated storage to Md

  • "d" is often used to indicate a device data structure

Host-Device data transfer

cudaMemcpy()

  • memory data transfer

  • requires four parameters:

    • pointer to destination

    • pointer to source

    • number of bytes copied

    • type of transfer

      • host to host

      • host to device

      • device to host

      • device to device

  • asynchronous transfer

Code example:

  • transfer a 64 * 64 single-precision float array

  • M is in host memory and Md is in device memory

CUDA function declarations

Example

Executed on the:

Only callable from the

__device__ float DeviceFunc()

device

device

__global__ void KernelFunc()

device

host

__host__ float HostFunc()

host

host

  • __host__ is the default value and is overridden by other qualifiers

  • __global__ defines kernel function and MUST return void

  • __device__ and __host__ can be used together to define the same function to be called both host and device. The compiler generates two different versions for the same function

Kernel execution

Parallel code (kernel) is launched and executed on a device by many threads.

Threads are grouped into thread blocks.

Parallel code is written for a thread

  • each thread is free to execute a unique code path

  • built-in thread and block ID variables

Threads launched for a parallel section are partitioned in thread blocks: grid = all blocks for a given kernel execution.

Thread block is a group of threads that can:

  • synchronize their execution

  • communicate via shared memory

Kernel launch is asynchronous: host doesn't wait for the kernel to finish - unless you tell it to do so

Exercises

  1. Array copy: arraSrc_h -[copy]-> arraySrc_d =[copy]=> arrayDest_d -[copy]-> arrayDest_h ==> compare

  2. Array reversal: host->device -[xxxxx(first with last]->device-> host-> test (multiblocks + error checking)

“The best way of learning about anything is by doing.” — Richard Branson

Last updated