We keep making transformers faster. What if we don't need them at all?

1 pointsposted 9 hours ago
by anima-core

1 Comments

anima-core

9 hours ago

We keep spending enormous effort making transformers run faster.

Quantization. Pruning. Speculative decoding. Better kernels. Better hardware.

All of that assumes the same thing: that every request should run the model.

I’ve been working on a systems paper that asks a simpler question first: does this request need a transformer invocation at all?

The paper introduces Meaning-First Execution (MFEE), a control-plane layer that sits upstream of the model and routes each request into one of four actions:

RENDER – run the transformer DIRECT – serve from deterministic logic or cached output NO_OP – do nothing ABSTAIN – refuse safely

On a representative replay workload of 1,000 mixed prompts, this reduced transformer execution by 75.1% while preserving 100% output equivalence when the model was invoked.

The idea isn’t to replace existing optimizations like quantization or kernel fusion. MFEE sits before all of that and reduces how often those optimizations are even needed in the first place.

What surprised me while working on this is how much attention goes toward squeezing marginal gains out of execution, while the much larger question of when execution is necessary at all gets far less focus.

The evaluation harness is public and reproducible if you want to dig into the methodology.

Thoughts?