Layout: S[(2,4,8) : (1@gpuid_y, 8@m, 1@m)] + R[2 : 1@gpuid_x]

R = replicated dimension
Layout: S[(2, 4, 8) : (1@gpuid_y, 8@m, 1@m)] + R[2 : 1@gpuid_x]

8×8 Matrix

GPU Mesh (2×2)

Address computation
addr = (i//4) × 1@gpuid_y + (i%4) × 8@m + j × 1@m
R[2 : 1@gpuid_x] → same data on both GPU_x=0 and GPU_x=1