Step 4: TMA Async Store (Writeback)

TMEM cannot write to GMEM directly — data flows through Registers and SMEM

Writeback Data Path

TMEM

fp32 accum

8-col chunk

fp32

cast

fp16

all threads

SMEM

Dsmem (128×64)

TMA store

GMEM

D[M, N]

Phase 1: TMEM → Registers (chunked 8-col)

Phase 2: Registers → SMEM → TMA Store

Why 8-col chunks?

tcgen05.ld reads max 8 cols per warpgroup copy. Loop 128/8=16 times for full row.

Why go through SMEM?

TMA only accesses shared memory. Path: TMEM→Reg→cast fp16→SMEM→TMA store→GMEM.

EPI_N=64 tiling

Dsmem is 128×64, not 128×128 — halves SMEM. Loop 128/64=2 times, each TMA-storing a 128×64 tile.