Tensor Core Benefits From Swizzling
Tensor core reads N×16 tile from shared memory — benefits from 8×8 conflict-free read that swizzle guarantees
Transpose A
No
Yes
Transpose B
No
Yes
N (output rows)
16
32
64
128
M (output cols)
16
32
64
128
K
16
Row group
Col group
A tile atoms
2×2 of 8×8
each 8×8 is a 8×16B cell, conflict-free if swizzled
A (128×128)
×
B (128×128)
→
C
K iteration: