Step 4: TMA Async Load

Key change: all-thread sync copy → single-thread TMA dispatch with barrier tracking

HW DMA

copy_async(dispatch="tma") offloads to the TMA engine — 127 threads freed, memory bus saturated

Barrier Sync

arrive.expect_tx(bytes) tells barrier how many bytes TMA will write. try_wait(phase) blocks until done. Phase flips each iteration.

Single Thread

tid == 0 issues both TMA load and MMA sequentially — no thread divergence overhead