15-442/15-642: Machine Learning Systems

Blackwell GPU and TIRx

Spring 2026
Tianqi Chen and Zhihao Jia
Carnegie Mellon University

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

Blackwell SM Architecture

Example GEMM Data Pipeline on Blackwell

tcgen05 & Tensor Memory

TMA: Tensor Memory Accelerator

Async Coordination: Barriers

MBarrier: Data Structure & APIs

MBarrier for Cross-Thread Synchronization

MBarrier to Signal TMA Completion

MBarrier to Signal tcgen05 Completion

Phase Tracking

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

What is TIRx: Native Level Access

Native Level Access

Minimal Tile level Access

TIRx: Determinstic Op Dispatch

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

Step 1: Sync Load + MMA + Sync Store

Step 1: Full Code

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

Step 4: TMA Async Load

Step 4: TMA Async Store (Writeback)

Step 4: Full Code

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

Pipeline: Overlap Load, Compute, Writeback

Warp Specialization: Overlap Everything

Step 7: Warp Specialization — Code

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

Why Cluster?

Distributed Shared Memory

Asymmetric Roles: CTA-0 vs CTA-1

Step 9: 2-CTA Cluster — Code

Outline

  • Blackwell Programming Model
  • TIRx Programming Model
  • Step 1 — Sync GEMM
  • Step 4 — TMA Async
  • Step 7 — Warp Specialization
  • Step 9 — 2-CTA Cluster
  • Summary

The Optimization Journey