Q: Why didn’t pre-FlashAttention ML compilers generate FlashAttention-style attention kernels?

AI workloads and hardware outpace DSL adaptation and maintenance.

One recurring bottleneck:

  1. User (Human/Agent) cannot express the best kernel optimizations in DSL.
  2. The compiler passes of DSL fail to utilize the latest optimizations/hardware features automatically when DSLs cannot express them.
  3. Users (Human/Agent) have to open the box and dig into compiler passes to inject optimizations, which can potentially introduce new failures.

Design takeaway: native-level access should always be part of the programming model for agents and human to get the state of art performance.

DSL references: [1] TileLang (tile-ai/tilelang), [2] Triton (triton-lang/triton), [3] ThunderKittens (HazyResearch/ThunderKittens), [4] Apache TVM, [5] Triton Gluon tutorial (01-intro.py), [6] NVIDIA CUTLASS CuTe DSL docs.
Sources (paraphrased): Horace He talk (from 34:00), Jane Street transcript; Matt Pharr, The story of ispc: origins (part 1) (quoting T. Foley). Video (from 34:00) | Talk transcript | Blog post