Why Cluster?
Double arithmetic intensity with the same memory bandwidth
1 CTA (Step 8)
SMEM
A
smem
128×64
B
smem
128×64
MMA
Output:
128 × 128 = 16,384
results
Data loaded: (128×64 + 128×64) × 2B =
32 KB
Arithmetic Intensity:
512 FMA/KB
2 CTAs (Step 9)
CTA-0
A₀
128×64
B₀
128×64
CTA-1
A₁
128×64
B₁
128×64
Cooperative MMA
Output:
256 × 256 = 65,536
results
Data loaded: 2 × (128×64 + 128×64) × 2B =
64 KB
Arithmetic Intensity:
1024 FMA/KB
2×
Same bandwidth
Each CTA still loads one A tile + one B tile
4× computation
MMA output grows from 128² to 256²
2× arithmetic intensity
More compute per byte transferred