Code Regions
Overview
Step 4 uses TMA (Tensor Memory Accelerator) for async global→SMEM loads, replacing sync
Tx.copy. A single elected thread issues all TMA loads and MMA instructions. Two barriers coordinate: tma_bar (load→compute) and mma_bar (compute done).Key change from Step 1:
Tx.copy → Tx.copy_async(dispatch="tma")