Async Coordination: Barriers

HW units signal mbarrier on completion — warp groups wait before proceeding
GMEM
A, B
TMA load
SMEM
tcgen05.mma
TMEM
accum
tcgen05.ld
Reg
cast
store
SMEM
TMA store
GMEM
D
Warp 0 — Producer
TMA dispatch
Warp 1 — Consumer
tcgen05 dispatch
Warp 0 — next phase
TMA dispatch
TMA
Engine
complete-tx
mbar
TMA ready
Tensor
Core
mbar
SMEM free
Coordination: