Distributed Shared Memory
Two adjacent SMs can read each other's SMEM through the interconnect
SM-0 (CTA-0)
Shared Memory
A₀
A tile 128×64
B₀
B tile 128×64
Tensor Memory
Accumulator
128×256
MMA Engine
SM-1 (CTA-1)
Shared Memory
A₁
A tile 128×64
B₁
B tile 128×64
Tensor Memory
Accumulator
128×256
MMA Engine
idle — CTA-0 drives MMA for both
tcgen05.mma
with
cta_group=2
reads SMEM from
both
CTAs in the cluster
Result is distributed: 128 rows in CTA-0's TMEM + 128 rows in CTA-1's TMEM →
256×256 total