Why Cluster?

Double arithmetic intensity with the same memory bandwidth

1 CTA (Step 8)

SMEM
Asmem
128×64
Bsmem
128×64
MMA
Output: 128 × 128 = 16,384 results
Data loaded: (128×64 + 128×64) × 2B = 32 KB
Arithmetic Intensity: 512 FMA/KB

2 CTAs (Step 9)

CTA-0
A₀
128×64
B₀
128×64
CTA-1
A₁
128×64
B₁
128×64
Cooperative MMA
Output: 256 × 256 = 65,536 results
Data loaded: 2 × (128×64 + 128×64) × 2B = 64 KB
Arithmetic Intensity: 1024 FMA/KB
Same bandwidth
Each CTA still loads one A tile + one B tile
4× computation
MMA output grows from 128² to 256²
2× arithmetic intensity
More compute per byte transferred