Data Layout: Sharding
Same notation, different axis → distributed across devices
A[i, j]
shard by col
across 2 GPUs
GPU 0 — cols 0-1
GPU 1 — cols 2-3