Warp Specialization: Overlap Everything

Dedicated roles for load / compute / writeback — maximize hardware utilization

Before (Step 4): Sequential TIME →
Single warp
L k=0
M k=0
L k=1
M k=1
L k=2
M k=2
L k=3
M k=3
L k=4
M k=4
WB
Load and compute are serialized — hardware idle 50% of time
After (Step 7): Pipelined TIME →
TMAWG1 / warp3
L k=0
L k=1
L k=2
L k=3
L k=4
MMAWG1 / warp0
idle
M k=0
M k=1
M k=2
M k=3
M k=4
WritebackWG0
idle
WB 0
WB 1
All three actors run concurrently — pipeline stays full
Barrier Roles
tma2mma[PIPE_DEPTH]
TMA signals MMA: SMEM data ready for stage k
Type: TMABar (TMA arrive with byte count)
mma2tma[PIPE_DEPTH]
MMA signals TMA: SMEM buffer free, can reuse stage k
Type: TCGen05Bar (tcgen05.commit)
mma2ld
MMA signals Writeback: TMEM accumulation complete
Type: TCGen05Bar (tcgen05.commit)
ld2mma
Writeback signals MMA: TMEM read complete, safe to overwrite
Type: MBarrier (128 threads arrive)