Why Prefer 128B Swizzle?

Each row of an atom is contiguous in global memory — wider rows = better locality
Highlight row
From tensor core side: any swizzle format works (16×16 tile decomposes into any atom size).
From global memory side: each atom row is a contiguous segment. Wider row = fewer, larger reads = better bandwidth.
SWIZZLE_128B gives 128 contiguous bytes per row — the best default unless tile width < 64 fp16.