The way memory is accessed is critical for performance
Global memory - coalescing access
Global memory - accessed via 32-, 64-, or 128- byte transactions.
Accesses can be coalesced - threads in the same warp access contiguous data
If coalesced - can perform several global memory accesses in a single transaction.
If not coalesced - may have to perform several long latency transactions => wasteful.
//kernel code:
couble result = array[...];
Global memory - coalesced access Global memory - uncoalesced access
Global memory - uncoalesced access Uncoalesced access - array of structures
Uncoalesced access - array of strustures Coalesced access - structure of arrays
Coalesced access - structure of arrays Coalescing - aligned and sequential
GPU reference quide
TODO: Update with new capabilities
Coalescing - aligned and non - sequential
GPU reference quide
TODO: Update with new capabilities
Coalescing - misaligned and sequential
GPU reference quide
TODO: Update with new capabilities
Shared memory - banks & conflicts
Shared memory is organised into banks:
Successive 32-bit words assigned to successive banks
__shared__ float shared[64];
Memory access conflicts don't occur if:
all threads in the same (half-) warp read different banks
all threads in the same (half-) warp access the same data in a single bank (broadcast)
Memory access conflicts do occur if multiple threads (but not all) in the same (half-) warp access the same bank => serialized access
Shared memory: bank conflicts (CC 1.x) Memory alignment
From programmers guide:
Any access to a variable in global memory compiles to a single instruction if and only if:
The size of the data type is 1,2,4,8, or 16 bytes
The data is naturally aligned (its adress is a multiple of its size)
?? Matrix transpose
Short vector types
An array of multi-element data structures?
sequential access pattern uses multiple times the necessary bandwidth
short vector types don't waste bandwidth, and use one instruction to load multiple elements: int2, char4, etc
it is possible to create your own short-vector types
Need to maintain alignment when reading, non-naturally aligned 8-byte or 16-byte words
2D Array: BaseAddress + width * ty + tx
WIdth of thread block & width of array - multiple of the warp size (1/2 warrp is 1.x compute)
cudaMallocPitch() adds the correct padding to improve access efficiency
Dealing with memory in CUDA
Minimize memory transfers - even if it includes doing inefficient calculations od the device
Coalesce all memory access
Favour shared memory access to global memory access
Avoid code execution branching within a single warp as this serializes the threads
Optimising memory throughput
Minimise data transfers between the host and the device
Minimise data transfers between global memory and the device by maximising use of on-chip memory (shared memory & caches)
Maximise optimal memory access patterns
Optimizing memory throughput