Step 1: Sync GEMM — Full Kernel Code

Click a colored region to see its explanation · Single tile 128×128×64


  

Code Regions

Overview
The simplest GEMM: one 128×128×64 tile, fully synchronous. All 128 threads sync-load A,B from GMEM→SMEM, one elected thread issues MMA, all threads writeback.
This is the skeleton — every later step builds on this structure.