chillee
2 days ago
This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.
If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.
There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.
pama
2 days ago
Agree that the writeup is very wrong, especially for the output tokens. Here is how anyone with enough money to allocate a small cluster of powerful GPUs can decode huge models at scale, since nearly 4 months ago, with costs of 0.2 USD/million output tokens.
https://lmsys.org/blog/2025-05-05-large-scale-ep/
This has gotten significantly cheaper yet with additional code hacks since then, and with using the B200s.
ma2rten
2 days ago
You can also look at the price of opensource models on openrouter, which are a fraction of the cost of closed source models. This is a market that is heavily commoditized, so I would expect it reflect the true cost with a small margin.
pama
2 days ago
If you make careful calculations and estimate the theoretical margins for inference only of most of the big open models on openrouter, the margins are typically crazy high if the openrouter providers served at scale (north of 800% for most of the large models). The high cost probably reflects salaries, investments, and amortization of other expenses like free serving or occasional partial serving occupancy. Sometimes it is hard to keep uniform high load due to other preferences of users that dont get covered at any price, eg maximal context length (which is costing output performance), latency, and time for first token, but also things like privacy guarantees, or simply switching to the next best model quickly. I have always thought that centralized inference is the real goldmine of AI because you get so much value at scale for hardly any cost.
Aeolun
2 days ago
As much as I appreciate you saying the math is wrong, it doesn’t really help me adjust my expectations unless you provide correct numbers as well.
resonious
2 days ago
Right. Now I want to know if they're really losing money or not.
Den_VR
2 days ago
So, bottom line, do you think it’s probable that either OpenAI or Anthropic are “losing money on inference?”
chillee
2 days ago
No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.
martinald
2 days ago
Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :).
Even rerunning the math on my use cases with way higher input token cost doesn't change much though.
chillee
2 days ago
The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.
The component about requiring long context lengths to be compute-bound for attention is also quite misleading.
Barbing
2 days ago
Anyone up to publishing their own guess range?
doctorpangloss
2 days ago
I’m pretty sure input tokens are cheap because they want to ingest the data for training later no? They want huge contexts to slice up.
awwaiid
a day ago
Afaik all the large providers flipped the default to contractually NOT train on your data. So no, training data context size is not a factor.
diamond559
2 days ago
Even if it is, ignoring the biggest costs going into the product and then claiming they are profitable would be actual fraud.
johnnypangs
2 days ago
As one of those people who doesn’t really understand llms, does anyone have any recommendations to better my understanding of them?