Tips & tricks
Tips & tricks
Where to declare variables?

Variable type restrictions (OLD)
Pointers can only point to memory allocated or declared in global memory:
allocated in the host and passed to the kernel:
__global__ void kernel KernelFunction(float* ptr)
obtained as the address of a global variable:
float* ptr = &GlobalVar;
Since CUDA 4.0 - Unified Virtual Addressing for 64bit OS on CC 2.x overcomes that restrictions - any pointer is allowed.
A common programming strategy
Global memory resides in device memory (DRAM) - much slower access than shared memory
Profitable way fo performing computation on the device is to tile data to take advantage of fast shared memory:
partition data into subsets that fit into shared memory
handle each data subset with one thread block by:
loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism;
performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element;
copying results from shared memory to global memory;
Constant memory also resides in device memory (DRAM) - much slower access than shared memory:
cached!
highly efficient access for read-only data
Carefully divide data according to access patterns:
R/Only -> constant memory (very fast if in cache)
R/W shared within block -> shared memory (very fast)
R/W within each thread -> register (very fast)
R/W inputs/outputs (results) -> global memory (very slow)
GPU atomic operations
Atomic operations on integers in shared and global memory:
associative operations on signed/unsigned ints:
atomicAdd()
atomicSub()
atomicExch()
atomicMin()
atomicMax()
atomicInc()
atomicDec()
atomicCAS()
some operations on bits
Typical structure of CUDA program
Global variables declaration
__host__
__device__, __global__, __constant__, texture
Function prototypes
__global__ void kernelOne(...)
float helperFunction(...)
Main ()
allocate memory space on the device -
cudaMalloc(&d_GlobalVariablePtr, bytes)
transfer data from host to device -
cudaMemCpy(d_GlobalVariablePtr, h_GlobalVariablePointer,..)
execution configuration setup
kernell call -
kernelOne<<<execution configuration>>>(args...)
transfer results from device to host -
CudaMemCpy(h_GlobalVariablePtr, d_...)
optional (in test mode): compare against golden (host computed) solution
Kernel - void kernelOne(type args...)
variables declaration -
__shared__
automatic variables transparently assigned to register or local memory
__syncthreads()...
Other functions
float helperFunction(int intVar...);
CUDA device memory space
Each thread can:
R/W per thread registers
R/W per-thread local memory (avoid)
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constantmemory
Read only per-grid texture memory
The host can:
R/W global constant and texture memory

Last updated
Was this helpful?