Step 4: TMA Async Store (Writeback)

TMEM cannot write to GMEM directly — data flows through Registers and SMEM

Writeback Data Path

TMEM
fp32 accum
8-col chunk
Register
fp32
cast
Register
fp16
all threads
SMEM
Dsmem (128×64)
TMA store
GMEM
D[M, N]
Phase 1: TMEM → Registers (chunked 8-col)
Phase 2: Registers → SMEM → TMA Store
Why 8-col chunks?
tcgen05.ld reads max 8 cols per warpgroup copy. Loop 128/8=16 times for full row.
Why go through SMEM?
TMA only accesses shared memory. Path: TMEM→Reg→cast fp16→SMEM→TMA store→GMEM.
EPI_N=64 tiling
Dsmem is 128×64, not 128×128 — halves SMEM. Loop 128/64=2 times, each TMA-storing a 128×64 tile.