TMA: Global to Shared Memory
Hardware 2D copy with optional swizzle — one instruction, no threads
Swizzle
None
128B
Row offset
Col offset
Global Memory
16×128 fp16 (each cell = 16B = 1×8 fp16)
TMA Engine
Data Mover
→
cp.async.bulk.tensor.2d
+ SWIZZLE_128B
Shared Memory
8×8 sectors (swizzled)
▸
Single-thread dispatch
(non-blocking)
▸
Automatic swizzling
— data arrives bank-conflict-free
▸
Bidirectional:
load (GMEM→SMEM) and store (SMEM→GMEM)