Warp Specialization: Overlap Everything

Dedicated roles for load / compute / writeback — maximize hardware utilization

Before (Step 4): Sequential TIME →

Single warp

L k=0

M k=0

L k=1

M k=1

L k=2

M k=2

L k=3

M k=3

L k=4

M k=4

Load and compute are serialized — hardware idle 50% of time

After (Step 7): Pipelined TIME →

TMAWG1 / warp3

L k=0

L k=1

L k=2

L k=3

L k=4

MMAWG1 / warp0

idle

M k=0

M k=1

M k=2

M k=3

M k=4

WritebackWG0

idle

WB 0

WB 1

All three actors run concurrently — pipeline stays full

Barrier Roles

tma2mma[PIPE_DEPTH] →

TMA signals MMA: SMEM data ready for stage k

Type: TMABar (TMA arrive with byte count)

mma2tma[PIPE_DEPTH] →

MMA signals TMA: SMEM buffer free, can reuse stage k

Type: TCGen05Bar (tcgen05.commit)

mma2ld →

MMA signals Writeback: TMEM accumulation complete

Type: TCGen05Bar (tcgen05.commit)

ld2mma →

Writeback signals MMA: TMEM read complete, safe to overwrite

Type: MBarrier (128 threads arrive)