CTA-0 is the coordinator — CTA-1 only loads data
| TMA Load | Barrier Signal | MMA | Writeback | |
|---|---|---|---|---|
| CTA-0 |
Load A₀, B₀
TMA into own SMEM
|
arrive()
Accumulate both CTAs' bytes into own tma2mma barrier
|
tcgen05.mma
Reads both SMEMs via cta_group=2
|
Writeback
TMEM → SMEM → GMEM
|
| CTA-1 |
Load A₁, B₁
TMA into own SMEM
|
✗
CTA-0 handles arrive
|
✗
CTA-0 drives MMA
|
Writeback
TMEM → SMEM → GMEM
|
tma2mma_cta0 = tma2mma.remote_view(0)
if cbx == 0:
tma2mma_cta0.arrive(stage,
CTA_GROUP * byte_count)
if cbx == 0:
Tx.gemm_async(...,
cta_group=CTA_GROUP)
mma2tma.arrive(stage,
cta_group=2, cta_mask=3)