Step 4: Full Code

Code Regions

Overview

Step 4 uses TMA (Tensor Memory Accelerator) for async global→SMEM loads, replacing sync Tx.copy. A single elected thread issues all TMA loads and MMA instructions. Two barriers coordinate: tma_bar (load→compute) and mma_bar (compute done).

Key change from Step 1: Tx.copy → Tx.copy_async(dispatch="tma")