TMEM cannot write to GMEM directly — data flows through Registers and SMEM
tcgen05.ld reads max 8 cols per warpgroup copy. Loop 128/8=16 times for full row.
Dsmem is 128×64, not 128×128 — halves SMEM. Loop 128/64=2 times, each TMA-storing a 128×64 tile.