Tiling and Swizzling with 3D TMA

Use 3D TMA to copy into tiled shared memory
Swizzle
Col offset

Global Memory

16×256 fp16 (contiguous, row = 512B)
3D TMA

Shared Memory (tiled & swizzled)