Tips & tricks
Tips & tricks
Where to declare variables?

Variable type restrictions (OLD)
Pointers can only point to memory allocated or declared in global memory:
allocated in the host and passed to the kernel:
__global__ void kernel KernelFunction(float* ptr)obtained as the address of a global variable:
float* ptr = &GlobalVar;
Since CUDA 4.0 - Unified Virtual Addressing for 64bit OS on CC 2.x overcomes that restrictions - any pointer is allowed.
A common programming strategy
Global memory resides in device memory (DRAM) - much slower access than shared memory
Profitable way fo performing computation on the device is to tile data to take advantage of fast shared memory:
partition data into subsets that fit into shared memory
handle each data subset with one thread block by:
loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism;
performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element;
copying results from shared memory to global memory;
Constant memory also resides in device memory (DRAM) - much slower access than shared memory:
cached!
highly efficient access for read-only data
Carefully divide data according to access patterns:
R/Only -> constant memory (very fast if in cache)
R/W shared within block -> shared memory (very fast)
R/W within each thread -> register (very fast)
R/W inputs/outputs (results) -> global memory (very slow)
GPU atomic operations
Atomic operations on integers in shared and global memory:
associative operations on signed/unsigned ints:
atomicAdd()atomicSub()atomicExch()atomicMin()atomicMax()atomicInc()atomicDec()atomicCAS()
some operations on bits
Starting with compute capability 1.1 for global and 1.2 for shared memory.
atomicAdd() also available for float starting with compute capability 2.0
Typical structure of CUDA program
Global variables declaration
__host____device__, __global__, __constant__, texture
Function prototypes
__global__ void kernelOne(...)float helperFunction(...)
Main ()
allocate memory space on the device -
cudaMalloc(&d_GlobalVariablePtr, bytes)transfer data from host to device -
cudaMemCpy(d_GlobalVariablePtr, h_GlobalVariablePointer,..)execution configuration setup
kernell call -
kernelOne<<<execution configuration>>>(args...)transfer results from device to host -
CudaMemCpy(h_GlobalVariablePtr, d_...)optional (in test mode): compare against golden (host computed) solution
Kernel - void kernelOne(type args...)
variables declaration -
__shared__automatic variables transparently assigned to register or local memory
__syncthreads()...
Other functions
float helperFunction(int intVar...);
CUDA device memory space
Each thread can:
R/W per thread registers
R/W per-thread local memory (avoid)
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constantmemory
Read only per-grid texture memory
The host can:
R/W global constant and texture memory

Last updated