Asymmetric Roles

CTA-0 is the coordinator — CTA-1 only loads data

TMA Load Barrier Signal MMA Writeback
CTA-0
Load A₀, B₀
TMA into own SMEM
arrive()
Accumulate both CTAs' bytes into own tma2mma barrier
tcgen05.mma
Reads both SMEMs via cta_group=2
Writeback
TMEM → SMEM → GMEM
CTA-1
Load A₁, B₁
TMA into own SMEM
CTA-0 handles arrive
CTA-0 drives MMA
Writeback
TMEM → SMEM → GMEM
remote_view
tma2mma_cta0 = tma2mma.remote_view(0)
Both CTAs' TMA targets CTA-0's barrier address. CTA-1 doesn't have its own tma2mma.
cbx == 0 guard on arrive
if cbx == 0:
  tma2mma_cta0.arrive(stage,
    CTA_GROUP * byte_count)
Only CTA-0 signals arrive. It reports the TOTAL bytes from both CTAs.
cbx == 0 guard on MMA
if cbx == 0:
  Tx.gemm_async(...,
    cta_group=CTA_GROUP)
  mma2tma.arrive(stage,
    cta_group=2, cta_mask=3)
Only CTA-0 issues MMA. cta_mask=3 (0b11) means the signal reaches BOTH CTAs' mma2tma barriers.