Code Regions
Overview
The simplest GEMM: one 128×128×64 tile, fully synchronous. All 128 threads sync-load A,B from GMEM→SMEM, one elected thread issues MMA, all threads writeback.
This is the skeleton — every later step builds on this structure.