15-442/15-642: Machine Learning Systems

GEMM on Modern GPUs

Spring 2026
Tianqi Chen and Zhihao Jia
Carnegie Mellon University

Outline

  • Swizzle Deeper Dive
  • Tensor Cores
  • TMA (Tensor Memory Accelerator)
  • Putting It Together

Outline

  • Swizzle Deeper Dive
  • Tensor Cores
  • TMA (Tensor Memory Accelerator)
  • Putting It Together

Recap: Simple 8x8 Swizzle

Real Example: SWIZZLE_128B

SWIZZLE_128B: Bank Sector View

Swizzle Atoms: All Formats

Swizzle Imposes Tiling Constraint

Outline

  • Swizzle Deeper Dive
  • Tensor Cores
  • TMA (Tensor Memory Accelerator)
  • Putting It Together

Tensor Core

Tensor Core Benefit From Swizzling

Outline

  • Swizzle Deeper Dive
  • Tensor Cores
  • TMA (Tensor Memory Accelerator)
  • Putting It Together

TMA: Global to Shared Memory

3D TMA for Tiling and Swizzling

Outline

  • Swizzle Deeper Dive
  • Tensor Cores
  • TMA (Tensor Memory Accelerator)
  • Putting It Together

Why Prefer 128B Swizzle?

GEMM Optimization

End-to-End GEMM Example