Thread scheduling

"The essence of strategy is choosing what not to do." - Michael Porter

Chapter 6: CUDA Execution and Scheduling

You've learned how to define a grid of threads and how to manage memory, but what actually happens when you launch a kernel? How does the GPU take your abstract grid and execute it on the physical hardware? This process is managed by the CUDA scheduler, a sophisticated piece of hardware designed to keep the GPU's thousands of cores as busy as possible.

Understanding this scheduling process is key to understanding CUDA performance. The scheduler operates at two levels: the device level (scheduling blocks onto SMs) and the SM level (scheduling warps within an SM).

Device-Level Scheduling: Assigning Blocks to SMs

When you launch a kernel, you are submitting a grid of thread blocks to the GPU. The GPU's global scheduler sees this grid and begins assigning blocks to available Streaming Multiprocessors (SMs).

Think of it like a dispatcher at a large factory. The dispatcher (the scheduler) has a list of jobs to be done (the grid of blocks) and a set of available factory floors (the SMs). The dispatcher assigns jobs to floors as they become available.

An SM can handle multiple thread blocks concurrently. The exact number depends on the GPU's architecture and the resource requirements (registers, shared memory) of your kernel.
Once a block is assigned to an SM, it stays there for its entire lifetime. It will not be moved to another SM.
If you launch more blocks than can fit on the GPU at once, the remaining blocks are queued up and will be assigned to SMs as the currently running blocks complete.

This process is entirely managed by the hardware, giving the GPU the flexibility to scale your workload across devices with different numbers of SMs.

SM-Level Scheduling: The Warp Scheduler and Latency Hiding

Once a block is assigned to an SM, the SM's internal scheduler takes over. This scheduler works not with individual threads, but with warps. As a reminder from Chapter 2, a warp is a group of 32 threads that execute instructions in lockstep.

The SM scheduler's primary job is latency hiding. Accessing data from the GPU's main memory (DRAM) is slow and can take hundreds of clock cycles. If the GPU had to wait for every memory read to complete before doing more work, it would spend most of its time idle.

The warp scheduler solves this problem brilliantly. Each SM has many warps assigned to it, but only one can execute instructions at any given moment. The scheduler works like this:

It picks a "ready" warp (one that is not waiting for anything) and issues its next instruction.
If that instruction is a slow memory read, the scheduler doesn't wait. It immediately marks that warp as "stalled" (waiting for data).
It then picks another "ready" warp from its pool and issues its next instruction.

By rapidly switching between all the available warps, the scheduler can find useful arithmetic work to do while other warps are waiting for memory operations to complete. This ability to hide memory latency is the single most important reason for a GPU's massive throughput. To achieve peak performance, you need to give the GPU enough active warps to hide this latency effectively.

The Peril of Thread Divergence

The warp-based execution model has one crucial performance implication: thread divergence.

Because all 32 threads in a warp execute instructions in lockstep, they must all follow the same execution path. What happens if your code has a data-dependent if-else statement?

// All threads in the warp reach this point together
if (threadIdx.x < 16) {
    // Path A
    do_something();
} else {
    // Path B
    do_something_else();
}
// All threads reconverge here

The warp cannot split. Instead, the hardware handles the divergence by serializing the execution paths:

Threads 0-15 (the first half of the warp) execute Path A. Threads 16-31 are temporarily disabled or "masked off."
Once Path A is complete, threads 0-15 are disabled.
Threads 16-31 (the second half) execute Path B.
After both paths are complete, all threads in the warp reconverge and continue executing in lockstep.

In this case, the total time taken is the sum of the time for Path A and Path B. The more the execution paths diverge within a warp, the more of the warp's potential performance is wasted. Minimizing thread divergence within a warp is a critical optimization strategy.

Funny Comment: "A warp scheduler is like a master plate-spinner at a circus. The moment one plate starts to wobble (stalls), it's already off spinning another one. The goal is to have so many plates spinning that you never notice the wobbles."

PreviousCooperation and synchronization NextMemory Model

Last updated 1 month ago

Was this helpful?