cjbarber
7 hours ago
It could be interesting to do the metric of intelligence per second.
ie intelligence per token, and then tokens per second
My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.
But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.
estsauver
5 hours ago
I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.
Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf
refulgentis
5 hours ago
I'm very worried for both.
Cerebras requires a $3K/year membership to use APIs.
Groq's been dead for about 6 months, even pre-acquisition.
I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview
nl
5 hours ago
Taalas is interesting. 16,000 TPS for Llama on a chip.
replete
10 minutes ago
Its exciting to see, but look at the die size for only an 8b model
micw
2 hours ago
On a very old model, it's more like 16.000 garbage words/s
nl
2 hours ago
Llama 3.1 8B is pretty useful for some thing. I use it to generate SQL pretty reliably for example.
They are doing an updated model in a month or so anyway, then a frontier level one "by summer".
DeathArrow
2 hours ago
I wonder how many token per seconds can they get if they put Mercury 2 on a chip.
freeqaz
5 hours ago
You can call Cerebras APIs via OpenRouter if you specify them as the provider in your request fyi. It's a bit pricier but it exists!
andai
4 hours ago
I used their API normally (pay per token) a few weeks ago. Their Coding Plan appears to be permanently sold out though.
ainch
4 hours ago
I don't think it's a good comparison given Inception work on software and Cerebras/Groq work on hardware. If Inception demonstrate that diffusion LLMs work well at scale (at a reasonable price) then we can probably expect all the other frontier labs to copy them quickly, similarly to OpenAI's reasoning models.
refulgentis
4 hours ago
Definitely depends on what you're buying, maybe some of the audience here was buying Groq and Cerebras chips? I don't think they sold them but can't say for sure.
If you're a poor schmoke like me, you'd be thinking of them as API vendors of ~1000 token/s LLMs.
Especially because Inception v1's been out for a while and we haven't seen a follow-the-leader effect.
Coincidentally, that's one of my biggest questions: why not?
estsauver
4 hours ago
I am currently using their APIs on a paygo plan, I think it might just be a capacity issue for new sign ups.
behnamoh
3 hours ago
Once again, it's a tech that Google created but never turned into a product. AFAIK in their demo last year, Google showed a special version of Gemini that used diffusion. They were so excited about it (on the stage) and I thought that's what they'd use in Google search and Gmail.
7thpower
5 hours ago
What do you mean by Grow is dead since about 6 months ago? Not refuting your point, but I’m curious.
refulgentis
5 hours ago
No new model since GPT-OSS 120B, er maybe Kimi K2 not-thinking? Basically there were a couple models it normally obviously support, and it didn't.
Something about that Nvidia sale smelled funny to me because the # was yuge, yet, the software side shut down decently before the acquisition.
But that's 100% speculation, wouldn't be shocked if it was:
"We were never looking to become profitable just on API users, but we had to have it to stay visible. So, yeah, once it was clear an Nvidia sale was going through, we stopped working 16 hours a day, and now we're waiting to see what Nvidia wants to do with the API"
bigbuppo
5 hours ago
Maybe make that intelligence per token per relative unit of hardware per watt. If you're burning 30 tons of coal to be 0.0000000001% better than the 5 tons of coal option because you're throwing more hardware at it, well, it's not much of a real improvement.
estsauver
4 hours ago
I think the fast inference options have historically been only marginally more expensive then their slow cousins. There's a whole set of research about optimal efficiency, speed, and intelligence pareto curves. If you can deliver even an outdated low intelligence/old model at high efficiency, everyone will be interested. If you can deliver a model very fast, everyone will be interested. (If you can deliver a very smart model, everyone is obviously the most interested, but that's the free space.)
But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.
volodia
5 hours ago
We agree! In fact, there is an emerging class of models aimed at fast agentic iteration (think of Composer, the Flash versions of proprietary and open models). We position Mercury 2 as a strong model in this category.
estsauver
4 hours ago
Do you guys all think you'll be able to convert open source models to diffusion models relatively cheaply ala the d1 // LLaDA series of papers? If so, that seems like an extremely powerful story where you get to retool the much, much larger capex of open models into high performance diffusion models.
(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)
jdthedisciple
an hour ago
Interesting suggestion.
Maybe we could use some sort of entropy-based metric as a proxy for that?
josephg
6 hours ago
Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?
At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!
estsauver
4 hours ago
For what it's worth, most people already are doing this! Some of the subagents in Claude Code (Explore, I think even compaction) default to Haiku and then you have to manually overwrite it with an env variable if you want to change it.
Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!
nubg
6 hours ago
Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!
dmichulke
3 hours ago
Useful for evaluating people as well