15-442/15-642: Machine Learning Systems
Blackwell GPU and TIRx
Spring 2026
Tianqi Chen and Zhihao Jia
Carnegie Mellon University
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary
Blackwell SM Architecture
Example GEMM Data Pipeline on Blackwell
TMA: Tensor Memory Accelerator
Async Coordination: Barriers
MBarrier: Data Structure & APIs
MBarrier for Cross-Thread Synchronization
MBarrier to Signal TMA Completion
MBarrier to Signal tcgen05 Completion
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary
What is TIRx: Native Level Access
Minimal Tile level Access
TIRx: Determinstic Op Dispatch
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary
Step 1: Sync Load + MMA + Sync Store
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary
Step 4: TMA Async Store (Writeback)
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary
Pipeline: Overlap Load, Compute, Writeback
Warp Specialization: Overlap Everything
Step 7: Warp Specialization — Code
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary
Distributed Shared Memory
Asymmetric Roles: CTA-0 vs CTA-1
Step 9: 2-CTA Cluster — Code
Outline
- Blackwell Programming Model
- TIRx Programming Model
- Step 1 — Sync GEMM
- Step 4 — TMA Async
- Step 7 — Warp Specialization
- Step 9 — 2-CTA Cluster
- Summary