Step 1: Sync Load + MMA + Sync Store

The simplest working GEMM (128×128×64) — 0.1ms, 0.021 TFLOP/s

TIRx Kernel (~40 lines)

    
Dataflow
Global Mem
Tx.copy
SMEM
gemm_async
TMEM
tcgen05.ld
Registers
cast+store
Global Mem