Key change: all-thread sync copy → single-thread TMA dispatch with barrier tracking
copy_async(dispatch="tma") offloads to the TMA engine — 127 threads freed, memory bus saturated
arrive.expect_tx(bytes) tells barrier how many bytes TMA will write.
try_wait(phase) blocks until done. Phase flips each iteration.
tid == 0 issues both TMA load and MMA sequentially — no thread divergence overhead