GEMM Data Pipeline on Blackwell
D = A × B — click an action to see the data path on the architecture
GMEM
A, B
TMA load
SMEM
tcgen05.mma
TMEM
accum
tcgen05.ld
Reg
cast
store
SMEM
TMA store
GMEM
D
Blackwell Architecture
Streaming Multiprocessor (SM)
Tensor Core (tcgen05)
5th-gen MMA
CUDA Core
FP/INT units
Shared Memory (SMEM)
228 KB per SM
Tensor Memory (TMEM)
128 lanes
Register File
TMA Engine
Data Mover
SM ...
× N
Global Memory (GMEM)
Click an action in the pipeline to highlight the data path.