15-442/15-642: Machine Learning Systems
Data Layouts
Spring 2026
Tianqi Chen and Zhihao Jia
Carnegie Mellon University
Outline
- Tiled & Thread Layout
- Distributed Layout
- Swizzle Layout
Outline
- Tiled & Thread Layout
- Distributed Layout
- Swizzle Layout
What is Data Layout
Mapping from logical indices to the phyiscal location of the data
or different devices...
Shape–Stride Model
We will use this notation:
Tile Layout
Representing tiling
Tile Layout: Interactive Example
Named Axes in Strides
We need to represent different hardware axese:
- Lane and col dimension in tensor memory
- Thread indcies
It is useful to have named axes, we use @m for normal meory
Row-major 8×16: S[(8, 16) : (16@m, 1@m)]
The @axis tag can also be other things like @warpid, or @tmemcol
Named Axes Example: Thread + Register
Outline
- Tiled & Thread Layout
- Distributed Layout
- Swizzle Layout
Distributed Axes: @gpuid_y, @gpuid_x
Representing broadcast / replication
Distributed Axes: Interactive
Outline
- Tiled & Thread Layout
- Distributed Layout
- Swizzle Layout
Background: Memory Banks
Different memory banks can be read out in parallel
Simple Swizzle: 8×8 Interactive Example
Real Example: SWIZZLE_128B
Swizzle as a Memory Optimization
Swizzle is an implicit remapping the hardware applies — just enable it
- Pass pre-swizzled addresses and use a consistent swizzle mode (e.g. SWIZZLE_128B) across ops
- Don't worry about swizzle yourself — let hardware handles it.
- Benefit: SWIZZLE_128B → conflict-free access to 8 rows and 8 columns at a time (fp16)
Swizzle is fundamentally a way to enable efficient column access — the idea is not specific to GPU shared memory!