TMA: Global to Shared Memory

Hardware 2D copy with optional swizzle — one instruction, no threads
Swizzle
Row offset
Col offset

Global Memory

16×128 fp16 (each cell = 16B = 1×8 fp16)
TMA
cp.async.bulk
tensor.2d
+ SWIZZLE_128B

Shared Memory

8×8 sectors (swizzled)