Tensor Core Benefits From Swizzling

Tensor core reads N×16 tile from shared memory — benefits from 8×8 conflict-free read that swizzle guarantees
Transpose A
Transpose B
N (output rows)
M (output cols)
K
Row group
Col group

A tile atoms

2×2 of 8×8
each 8×8 is a 8×16B cell, conflict-free if swizzled

A (128×128)

×

B (128×128)

C

K iteration: