The Optimization Journey

From 0.02 to 1322 TFLOP/s in 10 steps of TIRx

Performance Journey

🔄
TMA: Hardware Data Mover
Let hardware do the work, free 127 threads
tcgen05.mma: Tensor Memory
TMEM avoids register pressure, single-thread dispatch
🔀
Warp Specialization
Dedicated roles: load / compute / writeback overlap
🤝
2SM Cluster
2× compute per SMEM tile, match cuBLAS