Pipeline: Overlap Load, Compute, Writeback

With warp specialization, all three stages run concurrently

Execution Timeline
TMA Load
MMA Compute
Writeback
Why pipeline? Without pipelining, each tile waits for Load → Compute → Store to finish before starting the next. With 3+ pipeline stages, TMA loads tile i+2 while MMA computes tile i+1 while Writeback stores tile i — hiding latency and saturating all hardware units simultaneously.