Distributed Shared Memory

Two adjacent SMs can read each other's SMEM through the interconnect

SM-0 (CTA-0)

Shared Memory
A₀A tile 128×64
B₀B tile 128×64
Tensor Memory
Accumulator128×256
MMA Engine

SM-1 (CTA-1)

Shared Memory
A₁A tile 128×64
B₁B tile 128×64
Tensor Memory
Accumulator128×256
MMA Engine
idle — CTA-0 drives MMA for both
tcgen05.mma with cta_group=2 reads SMEM from both CTAs in the cluster
Result is distributed: 128 rows in CTA-0's TMEM + 128 rows in CTA-1's TMEM → 256×256 total