TMA: Global to Shared Memory
Hardware 2D copy with optional swizzle — one instruction, no threads
Swizzle
None
128B
Row offset
Col offset
Global Memory
16×128 fp16 (each cell = 16B = 1×8 fp16)
→
TMA
cp.async.bulk
tensor.2d
+ SWIZZLE_128B
Shared Memory
8×8 sectors (swizzled)