I can get Kimi K2.5 inference on openrouter for about $0.5/MTok input + $2.5/MTok output, from six providers that have no moat besides efficiently selling GPU time. We can assume they are doing so at a profit (they have no incentive to do this at a loss), giving us those numbers as the cost to serve a 1T-a32b model at scale.
Now we don't know the true size of any of the proprietary models, but my educated guess is that Sonnet is in about the same parameter range, just with better training and much better fine tuning and RLHF. Yet API pricing for Sonnet is $3/MTok input + $15/MTok output, exactly six times as expensive. Even Haiku is twice as expensive as Kimi K2.5.
I find it difficult to believe in a world where those API prices aren't profitable. For subscription pricing it's harder to tell. We hear about those that get insane value out of their subscription, but there has to be a large mass who never reaches their limits. With company-wide rollouts there might even be a lot of subscription users who consume virtually no tokens at all.
This is false. We may assume it's the most efficient way of generating revenue given their GPUs, but their overall profitability will just be a guess. They would still have incentives to run hardware at maximum, even when it's uncertain to eventually recoup costs.
> a world where those API prices aren't profitable
A lab with employees and models in training has other costs than the operating expenses of a GPU farm.
Why would a company sell inference on Openrouter if they're not profitable? Except for Grog/Cerebras and a few other hardware companies looking to showcase their new chips.
If they're losing money and have no VC backing, they'd just turn off the lights.
This is like saying that innovative medical drugs could be sold at a profit if only there was no patent protection and the innovative companies would still invest in R&D. Yes, on a token level pure inference costs might be profitable, but the frontier Ai labs will surely have to recoup their R&D investments at some point.
Yes. I would not consider Kimi a particularly good model relative to its size, and making a SotA model is a lot more expensive. But training costs are explicitly excluded when talking about the cost to serve tokens
But that's the same as thinking "This bar is selling a cocktail for $15. I could make it at home for 30 cents. They're making $14.7 dollars of profit per cocktail, the owner must be a millionaire now!"
The problem I have with this analysis is it's missing the multi-dimensional aspect of "is this profitable".
It's fair to say that if all these operators are competing for tokens, that the OpenRouter token operator (not sure the exact phrase but the people running the models) are accounting for some level of margin.
However, how many of these are running their own data centers and GPUs?
If they are running their own infrastructure, then it's not a simple equation of if each specific token set is profitable, since it needs to account for the cost of running the data center. It could be that they believe that it is profitable in the long term by utilizing the long tail of asset depreciation, but that isn't guaranteed.
IF they aren't running their own infrastructure, then it's much easier to claim that it's profitable and has a margin (outside of running their servers to manage the rented infrastructure).
HOWEVER, a lot of data centers have some pretty crazy low prices for GPUs that may be vying for user base and revenue over profitability. In these cases, if data center growth starts slowing due to slower buildout then it's very likely GPU prices go up and inference stops becoming profitable for the open router owners.
So long term it's not clear how profitable even these open models are.
OpenAI and Anthropic definitely fall into the latter category too. Their infrastructure requirements are much higher than the open models, and they are being given huge discounts so Microsoft/Amazon/Google can all claim revenue (since they have profitability coming from other parts). It's not clear if OpenAI and Anthropic models would be profitable at inference if they were paying rates that cloud hosts would make a profit from.
There's just way too many dimensions to this scenario to flat out state that open router proves inference is profitable at scale.
Check the token prices for open weight LLMs at various independent inference providers.
That gives you a very good estimate of "how much can you serve the tokens of a model of the size N for while making a profit".
Now, keep in mind: Kimi K2.5 is 1T MoE. Today's frontier LLMs are in the 1T to 5T range, also MoE. Make an estimate. Compare that estimate with the actual frontier lab prices.
I don't think it's as easy as looking at open weight API prices. We don't know whether the operators are making a profit on all the hardware they bought. Maybe the prices we pay just cover electricity. And it's not even certain that running costs are covered by API prices: The operators may be siphoning content and subsidize from selling that.
In the current volatile environment, the API prices are more of a baseline where we can assume it can't be much cheaper to operate these models.
That doesn't make sense in this environment because everyone is compute constrained with huge backlogs they can't fulfill. If these inference providers aren't making any money, they'd simply sell their GPUs to those who are starved for compute.
Don’t confuse inference (api usage) with the consumer plan products. When people say inference is profitable they are referring to the cost to serve a token via the API. The consumer products are absolutely a question mark on profitability and as we see with most of the business and enterprise plans, going away for pure on demand use (api cost) full time.
Profitability doesn't imply infinite ability to scale. Of course they will want to prioritize their most profitable customers when they hit capacity issues.
Those are subscription plans. They tweaked the limits/periods included in the subscription. Having higher limits for subscription plans didn't give them any more revenue.
They do it because their demand is higher than the compute that they have available to them. Their GPUs must be melting during peak hours so they're encouraging people who move their workload to off peak hours if possible.
Assuming 80GB H100 and you inference a model that is MoE and close to the size of the 80GB VRAM, you're going to see around 10k tokens/second fully batched and saturated. An example here might be Mixtral 8x7B.
You're generating about 36 million tokens/hour. Cost of Mixtral 8x7b on Open router is $0.54/M input tokens. $0.54/M output tokens.
You're looking at potentially $38.88/hour return on that H100 GPU. This is probably the best case scenario.
In reality, inference providers will use multiple GPUs together to run bigger, smarter models for a higher price.
3.99 at 8x instances, with a minimum 2 week commitment. Good luck getting 70% usage average during that time. Useful when you're running a training round and can properly gauge demand, not so great when you're offering an API.
It says the numbers are theoretically possible. Requiring a 66% usage to break even when 100% usage will piss off customers by invoking a queue means it’s a balancing act.
“Technically correct. The best kind of correct”. So inference may technically be _capable_ of being profitable, but I have question’s about them being profitable in _practice_.
Most/all private labs have cited inference is profitable. This was happening before the large push to scrap plans and largely charge folks the underlying api rates. Second take a look at the pricing of open models. Now certainly it’s not direct 1-1 comparison but we can use it as a baseline. Now of course folks might not be telling the truth but one of those situations where I see too many markers on the true side.
For supply look at outages and growth rates at companies like openrouter. The demand is growing every week.
> For the data center build outs, demand for tokens is still exceeding supply.
Can you provide any numbers for this?