Step 4: TMA Async — Full Kernel Code

Click a colored region to see its explanation


  

Code Regions

Overview
Step 4 uses TMA (Tensor Memory Accelerator) for async global→SMEM loads, replacing sync Tx.copy. A single elected thread issues all TMA loads and MMA instructions. Two barriers coordinate: tma_bar (load→compute) and mma_bar (compute done).
Key change from Step 1: Tx.copyTx.copy_async(dispatch="tma")