15-442/15-642: Machine Learning Systems
GEMM on Modern GPUs
Spring 2026
Tianqi Chen and Zhihao Jia
Carnegie Mellon University
Outline
- Swizzle Deeper Dive
- Tensor Cores
- TMA (Tensor Memory Accelerator)
- Putting It Together
Outline
- Swizzle Deeper Dive
- Tensor Cores
- TMA (Tensor Memory Accelerator)
- Putting It Together
Recap: Simple 8x8 Swizzle
Real Example: SWIZZLE_128B
SWIZZLE_128B: Bank Sector View
Swizzle Atoms: All Formats
Swizzle Imposes Tiling Constraint
Outline
- Swizzle Deeper Dive
- Tensor Cores
- TMA (Tensor Memory Accelerator)
- Putting It Together
Tensor Core Benefit From Swizzling
Outline
- Swizzle Deeper Dive
- Tensor Cores
- TMA (Tensor Memory Accelerator)
- Putting It Together
TMA: Global to Shared Memory
3D TMA for Tiling and Swizzling
Outline
- Swizzle Deeper Dive
- Tensor Cores
- TMA (Tensor Memory Accelerator)
- Putting It Together