Q: Why didn’t pre-FlashAttention ML compilers generate FlashAttention-style attention kernels?
AI workloads and hardware outpace DSL adaptation and maintenance.
- TIRx is our ongoing effort on ML compiler DSL for agentic and frontier programming R&D needs (e.g. megakernels).
- We already have many DSLs (
TileLang[1],Triton[2],ThunderKittens[3],TVM[4], ...), and history has seen many more come and go. - In practice, the strongest kernels are still often written in
CUDA, or built with CUDA-native libraries likeCUTLASS. - SOTA optimizations and GPU features evolve quickly from generation to generation.
- We are now also seeing native-first DSLs rising due to these same reasons (
Gluon[5],CuteDSL[6]).
One recurring bottleneck:
- User (Human/Agent) cannot express the best kernel optimizations in DSL.
- The compiler passes of DSL fail to utilize the latest optimizations/hardware features automatically when DSLs cannot express them.
- Users (Human/Agent) have to open the box and dig into compiler passes to inject optimizations, which can potentially introduce new failures.
Design takeaway: native-level access should always be part of the programming model for agents and human to get the state of art performance.
DSL references:
[1] TileLang (tile-ai/tilelang),
[2] Triton (triton-lang/triton),
[3] ThunderKittens (HazyResearch/ThunderKittens),
[4] Apache TVM,
[5] Triton Gluon tutorial (01-intro.py),
[6] NVIDIA CUTLASS CuTe DSL docs.
Sources (paraphrased): Horace He talk (from 34:00), Jane Street transcript; Matt Pharr, The story of ispc: origins (part 1) (quoting T. Foley).
Video (from 34:00) |
Talk transcript |
Blog post