🔄
TMA: Hardware Data Mover
Let hardware do the work, free 127 threads
⚡
tcgen05.mma: Tensor Memory
TMEM avoids register pressure, single-thread dispatch
🔀
Warp Specialization
Dedicated roles: load / compute / writeback overlap
🤝
2SM Cluster
2× compute per SMEM tile, match cuBLAS