CUDA C++ is painful

Directly operating on lowest-level can be painful, as you likely noticed in the previous example.

How Users Think

Human

Human reasoning is typically close to this style: use high-level operations over tensor slices (tiles), e.g.:

copy
GEMM
add
fill

Such subroutines can be reused.

Agent

When agents handle instruction-level details directly, many errors are easy to introduce. Thinking in higher-level kernel algorithms is less error-prone, keeps programs concise, makes changes easier (context is always finite), and makes it easier to compose optimizations.

TIRx Mechanism

TIRx provides a first-class mechanism for this: deterministic operator dispatch.

Explicit layout on tensors

Layout (swizzle + tile) attached to each tensor determines logical-to-physical mapping, provide efficient access to certain tile-slice patterns.

Execution scopes

Execution scopes are explicit (kernel/CTA/warpgroup/warp/thread), defining exactly which thread group executes each operator.

Deterministic operator dispatch over tiles

Tile-level operators (Tx.copy, Tx.gemm_async, Tx.cast) on tensor tile slices, matching high-level reasoning. They are deterministically dispatched to low-level PTX based on layout, slice and execution scope.