15-442/15-642: Machine Learning Systems

Data Layouts

Spring 2026
Tianqi Chen and Zhihao Jia
Carnegie Mellon University

Outline

  • Tiled & Thread Layout
  • Distributed Layout
  • Swizzle Layout

Outline

  • Tiled & Thread Layout
  • Distributed Layout
  • Swizzle Layout

What is Data Layout

Mapping from logical indices to the phyiscal location of the data

or different devices...

Shape–Stride Model

We will use this notation:

This notation can be viewed as simplfied version of CuTe, with difference that we adopt row-major ordering.

Tile Layout

Representing tiling

Tile Layout: Interactive Example

Named Axes in Strides

We need to represent different hardware axese:

  • Lane and col dimension in tensor memory
  • Thread indcies

It is useful to have named axes, we use @m for normal meory

  • Row-major 8×16: S[(8, 16) : (16@m, 1@m)]

The @axis tag can also be other things like @warpid, or @tmemcol

Named Axes Example: Thread + Register

Outline

  • Tiled & Thread Layout
  • Distributed Layout
  • Swizzle Layout

Distributed Axes: @gpuid_y, @gpuid_x

Representing broadcast / replication

Distributed Axes: Interactive

Outline

  • Tiled & Thread Layout
  • Distributed Layout
  • Swizzle Layout

Background: Memory Banks

Different memory banks can be read out in parallel

Simple Swizzle: 8×8 Interactive Example

Real Example: SWIZZLE_128B

Swizzle as a Memory Optimization

Swizzle is an implicit remapping the hardware applies — just enable it

  • Pass pre-swizzled addresses and use a consistent swizzle mode (e.g. SWIZZLE_128B) across ops
  • Don't worry about swizzle yourself — let hardware handles it.
  • Benefit: SWIZZLE_128B → conflict-free access to 8 rows and 8 columns at a time (fp16)

Swizzle is fundamentally a way to enable efficient column access — the idea is not specific to GPU shared memory!