zurfer
14 hours ago
It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.
Given the size of frontier models I would assume that they can incorporate many specializations and the most lasting thing here is the training environment.
But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.
criemen
7 hours ago
> or if it takes a couple of months to fold these advantages back into the frontier models.
Right now, I believe we're seeing that the big general-purpose models outperform approximately everything else. Special-purpose models (essentially: fine tunes) of smaller models make sense when you want to solve a specific task at lower cost/lower latency, and you transfer some/most of the abilities in that domain from a bigger model to a smaller one. Usually, people don't do that, because it's a quite costly process, and the frontier models develop so rapidly, that you're perpetually behind them (so in fact, you're not providing the best possible abilities).
If/when frontier model development speed slows down, training smaller models will make more sense.
barrell
6 minutes ago
> If/when frontier model development speed slows down
You do not believe that this has already started? It seems to me that we’re well into a massive slowdown
nextos
2 hours ago
The advantage of small purpose-specific models is that they might be much more robust i.e., unlikely to generate wrong sequences for your particular domain. That is at least my experience working on this topic during 2025. And, obviously, smaller models mean you may deploy them on cheaper hardware, latency is reduced, energy consumption is lower, etc. In some domains like robotics, these two advantages might be very compelling, but it's obviously early to draw any long-term conclusions.
fragmede
4 hours ago
Right, the Costco problem. A small boutique eg wine store might be able to do better for picking a very specific wine for a specific occasion, but Costco is just so much bigger that they can make it up in Volume and buy cases and cases of everything with a lower markup, so it ends up being cheaper to shop at Costco, no matter how much you want to support the local wine boutique.
Imustaskforhelp
2 hours ago
> But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.
Wow, I am so curious, can you provide me the source
I am so interested in a chess LLM's benchmark as someone who occasionally plays chess. I have thought about creating things like these but it would be very interesting to find the best model at chess which isn't stockfish/lila but general purpose large language models.
I also agree that there might be an explosion of purpose trained LLM's. I had this idea some year ago when there was llama / before deepseek that what if I want to write sveltekit and there are models like deepseek which know about sveltekit but they are so damn big and bloated when I only want to use sveltekit/svelte models. Yes there are thoughts on why we might need the whole network to get better quality but I genuinely feel like right now, the better quality is debtable thanks to all this benchmarkmaxxing and I would happily take a model trained on sveltekit on like preferably 4b-8b parameter but if an extremely good SOTA-ish model for sveltekit is even around 30-40b I would be happy since I could buy a gpu on my pc to run it or run it on my mac
I think my brother who actually knows what he's talking about in the AI space, (unlike me), also said the same thing a few months back to me as well.
In fact, its funny because I had asked him to please create a website comparing benchmarks of AI playing chess and having an option where we can make two AI LLM's play against each other and we can view it or we can also play against an LLM inside an actual chess board on the web and more..., I had given this idea to him a few months ago after the talk about small llm's really lol and he said that its good but he was busy right now. I think then later he might have forgotten about it and I had forgotten about it too until now.
deepanwadhwa
13 hours ago
-> GPT 3.5 was awesome at chess I don't agree with this. I did try to play chess with GPT3.5 and it was horrible. Full of hallucinations.
miki123211
12 hours ago
It was GPT-3 I think.
As far as I remember, it's post-training that kills chess ability for some reason (GPT-3 wasn't post-trained).
Imustaskforhelp
2 hours ago
This is so interesting, I am curious as to why, can you (or anyone) please provide any resources or insightful comments about it, they would really help a ton out here, thanks!
AmbroseBierce
4 hours ago
It reminds me of a story I read somewhere that some guy high on drugs climbed to the top of some elevated campus headlights shouting things about being a moth and loving lights, and the security guys tried telling him to go down but he paid no attention to that and time went on until a janitor came and shut off the lights, then turned one of those high powered handheld ones and point it at him the guy quickly climbed down there.
So yeah I think there are different levels of thinking, maybe future models with have some sort of internal models once they recognize patterns of some level of thinking, I'm not that knowledgeable of the internal workings of LLMs so maybe this is all nonsense.
onlyrealcuzzo
10 hours ago
Isn't the whole point of the MOE architecture exactly this?
That you can individually train and improve smaller segments as necessary
ainch
9 hours ago
Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.
pama
4 hours ago
> only uses 1/18th of the total parameters per-query.
only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.
idiotsecant
9 hours ago
I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.
viraptor
9 hours ago
That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490
alephnerd
12 hours ago
> if we'll see an explosion of purpose trained LLMs...
Domain specific models have been on the roadmap for most companies for years now for both competitive (why give up your moat to OpenAI or Anthropic) and financial (why finance OpenAI's margins) perspective.