Tensor Core Benefits From Swizzling

Tensor core reads M×16 tile from shared memory — benefits from 8×8 conflict-free read that swizzle guarantees
Transpose A
Transpose B
M (output rows)
N (output cols)
K
Row group
Col group

A tile atoms

2×2 of 8×8
each 8×8 is a 8×16B cell, conflict-free if swizzled

A (128×128)

×

B (128×128)

C

K iteration: