End-to-End GEMM: TMA + TensorCore

Data path: TMA loads tiles into shared memory, TensorCore reads for compute

A (global memory)

TMA

A (shared memory)

×

B (shared memory)

TensorCore

C

N block Outer K Inner K
B loading path (global → shared memory) omitted for simplicity