Why Native-level Access

Q: Why didn’t pre-FlashAttention ML compilers generate FlashAttention-style attention kernels?

AI workloads and hardware outpace DSL adaptation and maintenance.

TIRx is our ongoing effort on ML compiler DSL for agentic and frontier programming R&D needs (e.g. megakernels).
We already have many DSLs (TileLang^[1], Triton^[2], ThunderKittens^[3], TVM^[4], ...), and history has seen many more come and go.
In practice, the strongest kernels are still often written in CUDA, or built with CUDA-native libraries like CUTLASS.
SOTA optimizations and GPU features evolve quickly from generation to generation.
We are now also seeing native-first DSLs rising due to these same reasons (Gluon^[5], CuteDSL^[6]).

One recurring bottleneck:

User (Human/Agent) cannot express the best kernel optimizations in DSL.
The compiler passes of DSL fail to utilize the latest optimizations/hardware features automatically when DSLs cannot express them.
Users (Human/Agent) have to open the box and dig into compiler passes to inject optimizations, which can potentially introduce new failures.

Design takeaway: native-level access should always be part of the programming model for agents and human to get the state of art performance.

DSL references: [1] TileLang (tile-ai/tilelang), [2] Triton (triton-lang/triton), [3] ThunderKittens (HazyResearch/ThunderKittens), [4] Apache TVM, [5] Triton Gluon tutorial (01-intro.py), [6] NVIDIA CUTLASS CuTe DSL docs.
Sources (paraphrased): Horace He talk (from 34:00), Jane Street transcript; Matt Pharr, The story of ispc: origins (part 1) (quoting T. Foley). Video (from 34:00) | Talk transcript | Blog post