GPGPU hardware

"The processor is a cat that is always trying to get into a box, and the box is your program." - John D. Cook

Chapter 2: Under the Hood: How GPUs Work

To write effective CUDA code, you don't need to be a hardware engineer, but you do need a solid mental model of how a GPU is built. A GPU is not just a more powerful CPU; it's a fundamentally different kind of processor, designed for a different kind of work. The key to understanding this difference is a single concept: latency vs. throughput.

The Boutique Workshop vs. The Assembly Line

Let's refine our analogy.

  • A CPU is a boutique workshop run by a handful of master artisans (the cores). Each artisan is a generalist genius who can switch between complex tasks—from delicate woodworking to heavy metal forging—at a moment's notice. Their goal is to complete each individual custom project as fast as possible. This is low-latency design. They have a large, private workbench (L1/L2 cache) and a vast array of complex tools (a rich instruction set) to minimize the time for any one job.

  • A GPU is a massive factory assembly line. The factory is filled with thousands of apprentice workers (the CUDA cores). Each worker has only one simple tool and is trained to do one specific, repetitive task, like "tighten this bolt." The goal isn't the speed of a single product, but the sheer number of products that roll off the line every hour. This is high-throughput design. The entire factory's efficiency depends on keeping the main conveyor belt (the memory bus) constantly loaded with materials to feed the army of workers.

A diagram comparing a CPU with a few large cores to a GPU with many small cores.

Architectural Deep Dive: The Streaming Multiprocessor

So, how is this "factory" organized? A modern NVIDIA GPU is built around a hierarchy of processing units. The most important of these is the Streaming Multiprocessor (SM).

Think of the SM as a complete, self-contained factory floor within the larger GPU complex. Each GPU has many SMs, and each SM is an independent processor packed with its own resources:

  1. CUDA Cores: These are the workers. Each SM contains hundreds of these simple arithmetic logic units (ALUs) that perform the actual calculations.

  2. Warp Scheduler: This is the floor manager. It doesn't give instructions to individual workers. Instead, it groups 32 workers (threads) into a team called a warp. The scheduler then issues a single instruction, and all 32 threads in the warp execute it in perfect lockstep on their own private data. This "single instruction, multiple thread" execution is the heart of the SIMD model we discussed in Chapter 1.

  3. Shared Memory / L1 Cache: This is the communal workbench on the factory floor. It's a small, but extremely fast, pool of memory that all cores on that SM can share. This allows threads within the same team (thread block) to collaborate, share results, and avoid slow trips to the main warehouse (global memory).

  4. Registers: This is each worker's personal toolbelt. It's the absolute fastest memory available, but it's private to a single core.

When you launch a CUDA kernel, the programming model maps directly onto this hardware:

  • The Grid is the entire workload, distributed across the whole GPU.

  • A Block is a chunk of work assigned to a single SM. All threads in a block can cooperate using that SM's shared memory.

  • A Thread is a single worker, executed by a CUDA core as part of a warp.

A diagram showing the hierarchy of Grid -> Block -> Thread, which maps to GPU -> SM -> Core.

The Memory Superhighway

The difference in design philosophy is also starkly reflected in the memory system. A CPU's memory system is a network of backroads, optimized to get a single car to a specific address with minimal delay (low latency). A GPU's memory system is a massive, multi-lane superhighway, optimized to move an enormous volume of traffic all at once (high throughput).

This is why a GPU has its own dedicated, high-speed Global Memory (DRAM). While accessing any single piece of data from it is relatively slow, its wide bus allows it to deliver a torrent of data to all the SMs simultaneously, keeping the thousands of cores fed and busy.

Your success as a CUDA programmer depends on your ability to structure problems for this assembly line: break your data into thousands of small, independent pieces, and define a simple, repetitive task that the army of cores can execute in parallel.


Funny Comment: "A CPU is a sports car: fast, agile, and perfect for a quick trip to the store. A GPU is a freight train: it takes a while to get going, but it can move a mountain of cargo all at once."

Last updated

Was this helpful?