Device occupancy

Assigned blocks per SM depends on kernel resource usage

Description (per SM)

Limit (compute 1.3)

Limit (compute 2.0)

Max threads

1024

1538

Max thread blocks

Available shared memory

16384

49152

Available 32-bit registers

16384

32768

If limits are exceeded, number of blocks per SM is reduced as necessary

Greater occupancy is desirable because it helps to hide latency

There are 32768 registers in each SM in Fermi

this is an implementation decision, not part of CUDA
registers are dynamically partitioned across all blocks assigned to the SM
once assigned to a block, the register is NOT accessible by threads in other blocks
each thread in the same block only access registers assigned to itself

If each block has 512 threads and each thread uses 16 registers, how many thread can run on each SM?

How about if each thread increases the use of registers by 1?

Dynamic partitioning gives more flexibility to compilers/programmers:

one can run a smaller number of threads that require many registers each or a large number of threads that require few registers each - this allows for finer grain threading than traditional CPU threading models
the compiler can trade-off between instruction-level parallelism and thread-level parallelism - using more registers might improve the kernel performance, and overcome the thread scheduling limitation

--ptxas-options=v

This gives information about used registers, shared memory per block (user and system), and constant memory.

// S
$nvcc --ptas-options=-v tensorGPU.cu -o tensorGPU

ptxas info  :Used 3 registers, 8 + 16 bytes mem, 20 bytes xmem[0]

Details:

Can help find the sweet spot for the block size

Can highlight what are the limiting factors for occupancy..:

May have to experiment with block size to see what works best

Occupancy isn't the most important thing

Last updated 3 years ago

Was this helpful?