Tiling and Swizzling with 3D TMA
Use 3D TMA to copy into tiled shared memory
Swizzle
None
128B
Col offset
Global Memory
16×256 fp16 (contiguous, row = 512B)
→
3D TMA
Shared Memory (tiled & swizzled)