Matrix multiplication

Essentials

A kernel is the piece of code executed on the CUDA device by a single CUDA thread.

Blocks and grids may ve 1D, 2D or 3D.

Each thread has access inside the kernel to certain variables that define its position:

gridDim, blockIdx, blockDim, threadIdx

Example: Square matrix multiplicaiton

A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs:

  • local, register usage

  • thread ID usage

  • memory data transfer API between host and device

  • assume square matrix for simplicity

  • leave shared memory usage until later

P = M * N of size WIDTH * WIDTH

Without tiling:

  • one thread calculates one element of P

  • M and N are loaded WIDTH times from global memory

Memory layout on matrix in C

memory layout in C

Step 1: Simple host version in C

matrix multiplication - host

Step 2: Input data transfer

circle-info

Lines 6,9,13 - this is double pointer

cudaMemcpyHostToDevice means move from CPU (global memory) to GPU mamory

After line 13 - there is no copy as this is just allocation of output memory (see next)

Step 3: Output data transfer

circle-info

line 5 - as this is "move back" to CPU code - P and Pd here are on different paces

Step 4: Kernel function

circle-info

in kernel code you cannot see i,j as they are "removed" by parallel threads (done automagically)

elementM, elementN are local registers

line 12 => Single Program Multiple Data

matrix multiplication - device

Step 5: Kernel Invocation

circle-info

line 2 - only one block will be used

line 3 - number of threads in block equals our outpput matrix

What is going on?

matrix multiplication - one block of threads

One block of threads to compute matrix Pd = each thread computes one element of Pd.

Each thread:

  • loads a row of matrix Md

  • loads a column of matrix Md

  • perform one multiply and addition for each pair Md and Nd elements

  • compute to off-chip memory access ration close to 1:1 (not very high)

Size of matrix limited by the number of threads allowed in a thread block.

Last updated