A few days ago, on February 12, 2026, OpenAI released GPT-5.3-Codex-Spark. Before that, on January 14, 2026, OpenAI announced its Cerebras partnership. On paper, this is a product launch plus an infrastructure deal. In practice, I think it is a bigger shift: speed is no longer a side optimization. Speed is now a product capability.
Data-center infrastructure context. Photo by Fogo Solutions via Wikimedia Commons (CC BY-SA 4.0).
Why does that matter? Because a lot of us have quietly trained ourselves to trust slow outputs more than fast outputs, even when both are good.
Why I am writing this
Since reasoning models started becoming mainstream, we have been moving toward longer-running model behavior. That trend started with o1-preview on September 12, 2024, where OpenAI explicitly framed the model as one that spends more time thinking before answering. Then on December 5, 2024, ChatGPT Pro introduced o1 pro mode, again emphasizing longer thinking for reliability.
That orientation made sense. In many real tasks, more deliberate reasoning does correlate with better outcomes. And over time, many users internalized this heuristic: if it took longer, it probably thought harder, so it is probably better.
Now we are entering a different phase. OpenAI is explicitly positioning Codex-Spark as real-time coding, and the release notes report more than 1,000 tokens per second on low-latency serving. The same post also says frontier models can run for hours, days, or weeks on long-horizon tasks. So we now have both extremes in one product family: very long-horizon work and near-instant iteration.
That combination is powerful and important but it is also cognitively tricky for users.
The trust problem in real workflows
Imagine a developer using a coding model that usually takes 10 to 15 minutes for a substantial implementation. They get used to that pacing. They form trust around that pacing.
Then, with a faster serving tier, a similarly capable model returns comparable work in 2 minutes. Even if objective quality is unchanged, many users will feel that something is off. Not because the artifact is worse, but because the elapsed time no longer matches their mental model of “serious thinking.”
This is not only a UX curiosity. I would argue that the effect extends to product decisions, model routing, customer confidence, and evaluation behavior. Teams can end up over-valuing wall-clock duration as a proxy for quality, especially in workflows where humans review outcomes quickly and move on.
Specially now that companies are integrating generative technology more often into their customer facing products.
A simple proposal: token minutes
I want to propose a lightweight metric: token minutes.
Definition: Token minutes represent how long a standardized frontier baseline would have needed to generate the same total token volume (reasoning + output) for a task.
In short, token minutes convert raw throughput differences into a normalized “equivalent work time” unit.
token_minutes = total_tokens / (baseline_tokens_per_second * 60)
Where:
- total_tokens includes reasoning and visible output tokens for the completed run.
- baseline_tokens_per_second is an arithmetic mean throughput measured on a fixed benchmark harness across current frontier models.
Why this helps
If a fast model completes a task in 90 real seconds, but that output corresponds to 12 token minutes against the baseline, we can report both truths at once:
- Real latency: 1.5 minutes
- Normalized work equivalent: 12 token minutes
This separates two things that are currently getting mixed together:
- User experience speed (how quickly you get results)
- Work-equivalent effort (how much token-level computation happened)
Both matter. They just answer different questions.
How I would standardize it
I would define a reproducible benchmark harness and refresh the baseline periodically (for example, monthly or quarterly). Or potentially yearly. The frontier baseline pool would include one top-tier model each from OpenAI, Google, Anthropic, and xAI (or any other major lab in the future), measured under the same task suite and instrumentation method.
I would also maintain optional class baselines:
- Frontier class: SOTA models
- Mid class: strong mini-tier models
- Fast class: nano or low-latency models
Then I would publish, for each run:
- Real duration
- Total tokens
- Token minutes
- Class label (frontier / mid / fast)
Once this is consistent, teams can compare runs across model families without forcing “longer wall-clock = better” as an implicit rule.
A quick example
Suppose your standardized frontier baseline is 125 tokens/sec.
A fast coding model produces 90,000 total tokens in 1.8 real minutes.
token_minutes = 90,000 / (125 * 60) = 12.0
So that run can be communicated as:
- Completed in 1.8 minutes real time
- Equivalent to 12.0 token minutes of standardized work
This framing preserves the speed win without collapsing perception of effort or depth.
What this is (and what it is not)
This is a hypothesis, not a final conclusion.
I hypothesize that many users attach quality to elapsed time, and that there is a psychological threshold where very fast responses can feel less trustworthy even when they are correct. My current guess is that around the 10-minute mark, trust dynamics shift for many users who are used to long-running coding agents.
I also hypothesize that a normalization metric like token minutes can reduce this bias in product design, model evaluation, and customer reporting.
I plan to test this directly with controlled comparisons and user feedback loops. For now, I am sharing the framework because the market is clearly shifting toward low-latency inference, and our evaluation language should catch up.
Final thought
Faster models are not “less serious” models. They are different interaction modes. As builders, we should measure them with metrics that separate speed from substance.
Real minutes tell you how fast the experience feels. Token minutes can tell you how much standardized model work you got.
We need both.