End-to-End GEMM: TMA + TensorCore
Data path: TMA loads tiles into shared memory, TensorCore reads for compute
A
(global memory)
TMA
→
A
(shared memory)
×
B
(shared memory)
TensorCore
→
C
▷
N block
0
1
Outer K
0
1
Inner K
0
1
2
3
4
5
6
7
B loading path (global → shared memory) omitted for simplicity