GEMM Optimization

From shared memory tiling to TMA data movement

Click a step to see details

Design shared memory layout first, then configure TMA to fill it efficiently.

Tiling shape → Best swizzle → Swizzle-constrained tiling → TMA with swizzle