High inference latency and spiraling GPU costs have emerged as the primary bottlenecks for IT leaders deploying agentic AI systems. These workflows often generate thousands of tokens per query, creating a performance gap that current hardware struggles to bridge.
Now, researchers from the University of Maryland, Lawrence Livermore National Labs, Columbia University, and TogetherAI say they can triple inference speed on reasoning benchmarks by fine-tuning pretrained models so that acceleration is embedded into their weights, removing the need for speculative decoding or auxiliary draft models.
In a paper published this month, the team describes a multi-token prediction technique that converts standard next-token models into parallel decoders using a special added mask token and an online self-distillation objective.
In benchmark tests, the approach delivered more than 3x acceleration with minimal accuracy loss, a trade-off that could appeal to enterprises struggling to balance cost and model quality in production AI systems.
The final model reportedly retains the same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code.
How the technique works
Traditional LLMs generate one token per forward pass, a design that inherently caps throughput.
This serial bottleneck is especially problematic for reasoning models, which generate thousands of tokens during a “chain of thought,” even for short final responses. Producing multiple tokens in one pass reduces both latency and cost.
To ensure coherence, the researchers rely on a student–teacher setup. Using a zookeeper analogy, they note that a model predicting multiple words independently might nonsensically output that a zookeeper fed “meat to a panda.” The teacher model evaluates these multi-token spans to ensure they make sense together.
“We propose an RL-inspired training paradigm in which a student model generates a span of simultaneous token predictions,” the researchers said in the paper. “To avoid the pitfalls of the standard offline objective, the student output is scored by an LM critic/teacher, rather than being scored against a known ground-truth token sequence.”
“By comparing the student’s predictions against the next-token suggestions made by the teacher, we produce an on-policy reward signal that enables the student to quickly improve the quality of its multi-token predictions,” they added.
At inference time, the system uses a confidence-adaptive (ConfAdapt) decoding strategy that dynamically determines how many tokens to emit per pass. When the model is highly confident, it outputs larger chunks. When uncertainty rises, it falls back to smaller steps, preserving accuracy while maintaining speed gains.
In experiments on GSM8K math reasoning benchmarks, an 8B parameter model achieved more than 3x acceleration with less than a 3 percent drop in accuracy. A smaller 4B parameter model reached similar speedups, though with a larger 7 percent drop in accuracy. More aggressive configurations pushed acceleration to 5x, though at steeper accuracy costs.
Unlike speculative decoding, which requires auxiliary speculator models and specialized inference pipelines, this approach trains a single model that retains the same implementation as the original checkpoint and requires no auxiliary verifier.
What this means for enterprise AI
Analysts say the bigger question is whether this approach meaningfully changes how inference stacks are designed in production.
“Speculative decoding attempts to break that constraint by introducing a draft model that proposes tokens and a target model that verifies them,” said Sanchit Vir Gogia, chief analyst at Greyhound Research. “In theory, this yields lossless acceleration. In practice, verification cost, batching interaction, and draft-target drift reduce realized gains.”
By contrast, he said, the multi-token approach retains the autoregressive backbone but shifts optimization into the training phase.
“The economic impact depends on entropy distribution across the output,” Gogia said. “In reasoning-heavy or structured tasks, predictable spans can be emitted in larger blocks with limited degradation. In higher-entropy, open-ended generation, acceleration shrinks. This is selective compression, not universal speed.”
That distinction matters for enterprise deployments.
“ConfAdapt is fundamentally entropy-sensitive,” Gogia said. “Its strategic advantage is maximized in workloads characterized by structured scaffolding, deterministic language segments, and advisory outputs subject to human oversight.”
Enterprises, Gogia said, should view the technique as a calibrated efficiency lever rather than a universal acceleration switch.
Go to Source
Author: