libraryofbabel
5 hours ago
I read this article back when I was learning the basics of transformers; the visualizations were really helpful. Although in retrospect knowing how a transformer works wasn't very useful at all in my day job applying LLMs, except as a sort of deep background for reassurance that I had some idea of how the big black box producing the tokens was put together, and to give me the mathematical basis for things like context size limitations etc.
I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely. That is a trap. Big SotA models these days exhibit so much nontrivial emergent phenomena (in part due to the massive application of reinforcement learning techniques) that give them capabilities very few people expected to ever see when this architecture first arrived. Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks. We were wrong. That points towards some caution and humility about using network architecture alone to reason about how LLMs work and what they can do. You'd really need to be able to poke at the weights inside a big SotA model to even begin to answer those kinds of questions, but unfortunately that's only really possible if you're a "mechanistic interpretability" researcher at one of the major labs.
Regardless, this is a nice article, and this stuff is worth learning because it's interesting for its own sake! Right now I'm actually spending some vacation time implementing a transformer in PyTorch just to refresh my memory of it all. It's a lot of fun! If anyone else wants to get started with that I would highly recommend Sebastian Raschka's book and youtube videos as way into the subject: https://github.com/rasbt/LLMs-from-scratch .
Has anyone read TFA author Jay Alammar's book (published Oct 2024) and would they recommend it for a more up-to-date picture?
ozgung
4 hours ago
I think the biggest problem is that most tutorials use words to illustrate how the attention mechanism works. In reality, there are no word-associated tokens inside a Transformer. Tokens != word parts. An LLM does not perform language processing inside the Transformer blocks, and a Vision Transformer does not perform image processing. Words and pixels are only relevant at the input. I think this misunderstanding was a root cause of underestimating their capabilities.
lugu
2 hours ago
Nice video o mechanical interpretability from Welch Labs:
miki123211
4 hours ago
> Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks.
I feel like there are three groups of people:
1. Those who think that LLMs are stupid slop-generating machines which couldn't ever possibly be of any use to anybody, because there's some problem that is simple for humans but hard for LLMs, which makes them unintelligent by definition.
2. Those who think we have already achieved AGI and don't need human programmers any more.
3. Those who believe LLMs will destroy the world in the next 5 years.
I feel like the composition of these three groups is pretty much constant since the release of Chat GPT, and like with most political fights, evidence doesn't convince people either way.
libraryofbabel
3 hours ago
Those three positions are all extreme viewpoints. There are certainly people who hold them, and they tend to be loud and confident and have an outsize presence in HN and other places online.
But a lot of us have a more nuanced take! It's perfectly possible to believe simultaneously that 1) LLMs are more than stochastic parrots 2) LLMs are useful for software development 3) LLMs have all sorts of limitations and risks (you can produce unmaintainable slop with them, and many people will, there are massive security issues, I can go on and on...) 4) We're not getting AGI or world-destroying super-intelligence anytime soon, if ever 5) We're in a bubble and it's going to pop and cause a big mess 6) This tech is still going to be transformative long term, on a similar level to the web and smartphones.
Don't let the noise from the extreme people who formed their opinions back when ChatGPT came out drown out serious discussion! A lot of us try and walk a middle course with this and have been and still are open to changing our minds.
brcmthrowaway
3 hours ago
How was reinforcement learning used as a gamechanger?
What happens to an LLM without reinforcement learning?
malaya_zemlya
9 minutes ago
You can download a base model (aka foundation, aka pretrain-only) from huggingface and test it out. These were produced without any RL.
However, most modern LLMs, even base models, would be not just trained on raw internet text. Most of them were also fed a huge amount of synthetic data. You often can see the exact details in their model cards. As a result, if you sample from them, you will notice that they love to output text that looks like:
6. **You will win millions playing bingo.**
- **Sentiment Classification: Positive**
- **Reasoning:** This statement is positive as it suggests a highly favorable outcome for the person playing bingo.
This is not your typical internet page.libraryofbabel
2 hours ago
The essence of it is that after the "read the whole internet and predict the next token" pre-training step (and the chat fine-tuning), SotA LLMs now have a training step where they solve huge numbers of tasks that have verifiable answers (especially programming and math). The model therefore gets the very broad general knowledge and natural language abilities from pre-training and gets good at solving actual problems (problems that can't be bullshitted or hallucinated through because they have some verifiable right answer) from the RL step. In ways that still aren't really understood, it develops internal models of mathematics and coding that allow it to generalize to solve things it hasn't seen before. That is why LLMs got so much better at coding in 2025; the success of tools like Claude Code (to pick just one example) is built upon it. Of course, the LLMs still have a lot of limitations (the internal models are not perfect and aren't like how humans think at all), but RL has taken us pretty far.
Unfortunately the really interesting details of this are mostly secret sauce stuff locked up inside the big AI labs. But there are still people who know far more than I do who do post about it, e.g. Andrej Karpathy discusses RL a bit in his 2025 LLMs Year in Review: https://karpathy.bearblog.dev/year-in-review-2025/
brcmthrowaway
an hour ago
Do you have the answer to the second question? Is an LLM trained on the internet just GPT-3?
libraryofbabel
31 minutes ago
I don't know - perhaps someone who's more of an expert or who's worked a lot with open source models that haven't been RL-ed can weigh in here!
But certainly without the RL step, the LLM would be much worse at coding and would hallucinate more.
nrhrjrjrjtntbt
4 hours ago
It is almost like understanding wood at a molecular level and being a carpenter. It also may help the carpentery, but you cam be a great one without it. And a bad one with the knowledge.