Step 1: Sync Load + MMA + Sync Store
The simplest working GEMM (128×128×64) — 0.1ms, 0.021 TFLOP/s
TIRx Kernel (~40 lines)
Dataflow
Global Mem
Tx.copy
SMEM
gemm_async
TMEM
tcgen05.ld
Registers
cast+store
Global Mem