Step 4: TMA Async Load

Key change: all-thread sync copy → single-thread TMA dispatch with barrier tracking

Step 3 — Sync Load (all 128 threads)
Step 4 — TMA Async (single thread)
HW DMA
copy_async(dispatch="tma") offloads to the TMA engine — 127 threads freed, memory bus saturated
Barrier Sync
arrive.expect_tx(bytes) tells barrier how many bytes TMA will write. try_wait(phase) blocks until done. Phase flips each iteration.
Single Thread
tid == 0 issues both TMA load and MMA sequentially — no thread divergence overhead