Device memory access & conflicts
The way memory is accessed is critical for performance
Global memory - coalescing access
Global memory - accessed via 32-, 64-, or 128- byte transactions.
Accesses can be coalesced - threads in the same warp access contiguous data
If coalesced - can perform several global memory accesses in a single transaction.
If not coalesced - may have to perform several long latency transactions => wasteful.
//kernel code:
couble result = array[...];

Global memory - uncoalesced access
//kernel code:
couble result = array[...];

Uncoalesced access - array of structures
//on host:
typedef struct data{
int index;
double value;
}
data *myData;
cudaMalloc((void **) &myData, numVals * sizeof(data));
...
//in kernel:
idx = blockDim.x * blockIdx.x + threadIdx.x;
int a = myData[idx].index;
double b = mydata[idx].value;

Coalesced access - structure of arrays
//on host:
typedef struct data{
int index;
double value;
}
data *myData;
cudaMalloc((void **) &myData.index, numVals * sizeof(int));
cudaMalloc((void **) &myData.value, numVals * sizeof(double));
...
//in kernel:
idx = blockDim.x * blockIdx.x + threadIdx.x;
int a = myData.index[idx];
double b = mydata.value[idx];

Coalescing - aligned and sequential

Compute capability
1.0 /1.1
1.2/1.3
2.0
Memory transactions
uncached
uncached
cached
1* 64B at 128
1* 64B at 128
1* 128B at 128
1* 64B at 192
1* 64B at 192
GPU reference quide
TODO: Update with new capabilities
Coalescing - aligned and non - sequential

Compute capability
1.0 /1.1
1.2/1.3
2.0
Memory transactions
uncached
uncached
cached
8* 32B at 128
1* 64B at 128
1* 128B at 128
8* 32B at 160
1* 64B at 192
8* 32B at 192
8* 32B at 224
GPU reference quide
TODO: Update with new capabilities
Coalescing - misaligned and sequential

Compute capability
1.0 /1.1
1.2/1.3
2.0
Memory transactions
uncached
uncached
cached
7* 32B at 128
1* 128B at 128
1* 128B at 128
8* 32B at 160
1* 64B at 192
1* 128B at 256
8* 32B at 192
1* 32B at 256
8* 32B at 224
1* 32B at 256
GPU reference quide
TODO: Update with new capabilities
Shared memory - banks & conflicts
Shared memory is organised into banks:
16 banks for CC 1.x
32 banks for CC 2.x
Successive 32-bit words assigned to successive banks
__shared__ float shared[64];

Memory access conflicts don't occur if:
all threads in the same (half-) warp read different banks
all threads in the same (half-) warp access the same data in a single bank (broadcast)
Memory access conflicts do occur if multiple threads (but not all) in the same (half-) warp access the same bank => serialized access

Memory alignment
From programmers guide:
Any access to a variable in global memory compiles to a single instruction if and only if:
The size of the data type is 1,2,4,8, or 16 bytes
The data is naturally aligned (its adress is a multiple of its size)

?? Matrix transpose
Short vector types
An array of multi-element data structures?
sequential access pattern uses multiple times the necessary bandwidth
short vector types don't waste bandwidth, and use one instruction to load multiple elements: int2, char4, etc
it is possible to create your own short-vector types

Memory dangers
Need to maintain alignment when reading, non-naturally aligned 8-byte or 16-byte words
2D Array: BaseAddress + width * ty + tx
WIdth of thread block & width of array - multiple of the warp size (1/2 warrp is 1.x compute)
cudaMallocPitch() adds the correct padding to improve access efficiency
Dealing with memory in CUDA
Minimize memory transfers - even if it includes doing inefficient calculations od the device
Coalesce all memory access
Favour shared memory access to global memory access
Avoid code execution branching within a single warp as this serializes the threads
Optimising memory throughput
Minimise data transfers between the host and the device
Minimise data transfers between global memory and the device by maximising use of on-chip memory (shared memory & caches)
Maximise optimal memory access patterns

Last updated
Was this helpful?