hackernews client

The Illustrated Transformer

500 pointsposted 2 months ago

(jalammar.github.io)

91 Comments

libraryofbabel

2 months ago

I read this article back when I was learning the basics of transformers; the visualizations were really helpful. Although in retrospect knowing how a transformer works wasn't very useful at all in my day job applying LLMs, except as a sort of deep background for reassurance that I had some idea of how the big black box producing the tokens was put together, and to give me the mathematical basis for things like context size limitations etc.

I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely. That is a trap. Big SotA models these days exhibit so much nontrivial emergent phenomena (in part due to the massive application of reinforcement learning techniques) that give them capabilities very few people expected to ever see when this architecture first arrived. Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks. We were wrong. That points towards some caution and humility about using network architecture alone to reason about how LLMs work and what they can do. You'd really need to be able to poke at the weights inside a big SotA model to even begin to answer those kinds of questions, but unfortunately that's only really possible if you're a "mechanistic interpretability" researcher at one of the major labs.

Regardless, this is a nice article, and this stuff is worth learning because it's interesting for its own sake! Right now I'm actually spending some vacation time implementing a transformer in PyTorch just to refresh my memory of it all. It's a lot of fun! If anyone else wants to get started with that I would highly recommend Sebastian Raschka's book and youtube videos as way into the subject: https://github.com/rasbt/LLMs-from-scratch .

Has anyone read TFA author Jay Alammar's book (published Oct 2024) and would they recommend it for a more up-to-date picture?

crystal_revenge

2 months ago

> massive application of reinforcement learning techniques

So sad that "reinforcement learning" is another term whose meaning has been completely destroyed by uneducated hype around LLMs (very similar to "agents"). 5 years ago nobody familiar with RL would consider what these companies are doing as "reinforcement learning".

RLHF and similar techniques are much, much closer to traditional fine-tuning than they are reinforcement learning. RL almost always, historically, assumes online training and interaction with an environment. RLHF is collecting data from user and using it to reach the LLM to be more engaging.

This fine-tuning also doesn't magically transform LLMs into something different, but it is largely responsible for their sycophantic behavior. RLHF makes LLMs more pleasing to humans (and of course can be exploited to help move the needle on benchmarks).

It's really unfortunate that people will throw away their knowledge of computing in order to maintain a belief that LLMs are something more than they are. LLMs are great, very useful, but they're not producing "nontrivial emergent phenomena". They're increasing trained a products to invoked increase engagement. I've found LLMs less useful in 2025 than in 2024. And the trend in people not opening them up under the hood and playing around with them to explore what they can do has basically made me leave the field (I used to work in AI related research).

libraryofbabel

2 months ago

I wasn't referring to RLHF, which people were of course already doing heavily in 2023, but RLVR, aka LLMs solving tons of coding and math problems with a reward function after pre-training. I discussed that in another reply, so I won't repeat it here; instead I'd just refer you to Andrej Karpathy's 2025 LLM Year in Review which discusses it. https://karpathy.bearblog.dev/year-in-review-2025/

> I've found LLMs less useful in 2025 than in 2024.

I really don't know how to reply to this part without sounding insulting, so I won't.

crystal_revenge

2 months ago

While RLVF is neat, it still is an 'offline' learning model that just borrows a reward function similar to RL.

And did you not read the entire post? Karpathy basically calls out the same point that I am making regarding RL which "of course can be exploited to help move the needle on benchmarks":

> Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form

Regarding:

> I really don't know how to reply to this part without sounding insulting, so I won't.

Relevant to citing him: Karpathy has publicly praised some of my past research in LLMs, so please don't hold back your insults. A poster on HN telling me I'm "not using them right!!!" won't shake my confidence terribly. I use LLMs less this year than last year and have been much more productive. I still use them, LLMs are interesting, and very useful. I just don't understand why people have to get into hysterics trying to make them more than that.

I also agree with Karpathy's statement:

> In any case they are extremely useful and I don't think the industry has realized anywhere near 10% of their potential even at present capability.

But magical thinking around them is slowing down progress imho. Your original comment itself is evidence of this:

> I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely.

I would say "Rip them open! Start playing around with the internals! Mess around with sampling algorithms! Ignore the 'win market share' hype and benchmark gaming and see just what you can make these models do!" Even if restricted to just open, relatively small models, there's so much more interesting work in this space.

mediaman

a month ago

RLVR is not offline learning. It's not learning from a static dataset. These are live rollouts that are being verified and which update the weights at each pass based on feedback from the environment.

You might argue that traditional RL involves multiple states the agent moves through. But autoregressive LLMs are the same: a forward pass generating a token also creates change in state.

After training, the weights are fixed, of course, but that is the case of most traditional RL systems. RL does not intrinsically mean a continual updating of weights in deployment, which carries a bunch of problems.

From the premise that RLVR can be applied to benchmaxx (true!) it does not follow that it therefore is only good for that.

menaerus

2 months ago

What do you think about Geoffrey Hinton's concerns about the AI (minus "AGI")? Do you agree with those concerns or do you believe that LLMs are only that much "useful" so they wouldn't impose a risk on our society?

user

2 months ago

[deleted]

holtkam2

2 months ago

I agree and disagree. In my day job as an AI engineer I rarely if ever need to use any “classic” deep learning to get things done. However, I’m a firm believer that understanding the internals of a LLM can set you apart as an gen AI engineer, if you’re interested in becoming the top 1% in your field. There can and will be situations where your intuition about the constraints of your model is superior compared to peers who consider the LLM a black box. I had this advice given directly to me years ago, in person, by Clem Delangue of Hugging Face - I took it seriously and really doubled down on understanding the guts of LLMs. I think it’s served me well.

I’d give similar advice to any coding bootcamp grad: yes you can get far by just knowing python and React, but to reach the absolute peak of your potential and join the ranks of the very best in the world in your field, you’ll eventually want to dive deep into computer architecture and lower level languages. Knowing these deeply will help you apply your higher level code more effectively than your coding bootcamp classmates over the course of a career.

libraryofbabel

2 months ago

I suppose I actually agree with you, and I would give the same advice to junior engineers too. I've spent my career going further down the stack than I really needed to for my job and it has paid off: everything from assembly language to database internals to details of unix syscalls to distributed consensus algorithms to how garbage collection works inside CPython. It's only useful occasionally, but when it is useful, it's for the most difficult performance problems or nasty bugs that other engineers have had trouble solving. If you're the best technical troubleshooter at your company, people do notice. And going deeper helps with system design too: distributed systems have all kinds of subtleties.

I mostly do it because it's interesting and I don't like mysteries, and that's why I'm relearning transformers, but I hope knowing LLM internals will be useful one day too.

MIA_Alive

a month ago

Wouldn't you say that people who pursue deep architectural knowledge should just go down the AI Researcher career track? I feel like that's where that sort of knowledge actualy matters.

ozgung

2 months ago

I think the biggest problem is that most tutorials use words to illustrate how the attention mechanism works. In reality, there are no word-associated tokens inside a Transformer. Tokens != word parts. An LLM does not perform language processing inside the Transformer blocks, and a Vision Transformer does not perform image processing. Words and pixels are only relevant at the input. I think this misunderstanding was a root cause of underestimating their capabilities.

energy123

2 months ago

An example of why a basic understanding is helpful:

A common sentiment on HN is that LLMs generate too many comments in code.

But comment spam is going to help code quality, due to the way causal transformers and positional encoding works. The model has learned to dump locally-specific reasoning tokens where they're needed, in a tightly scoped cluster that can be attended to easily, and forgetting about just as easily later on. It's like a disposable scratchpad to reduce the errors in the code it's about to write.

The solution to comment spam is textual/AST post-processing of generated code, rather than prompting the LLM to handicap itself by not generating as much comments.

libraryofbabel

2 months ago

Unless you have evidence from a mechanistic interpretability study showing what's happening inside the model when it creates comments, this is really only a plausible-sounding just-so story.

Like I said, it's a trap to reason from architecture alone to behavior.

energy123

2 months ago

Yes I should have made it clear that it is an untested hypothesis.

p1esk

2 months ago

You’re describing this like if you actually knew what’s going on in these models. In reality it’s just a guess and not a very convincing one.

minikomi

2 months ago

An example of why a basic understanding is helpful:

A common sentiment on HN is that LLMs generate too many comments in code.

For good reason -- comment sparsity improves code quality, due to the way causal transformers and positional encoding work. The model has learned that real, in-distribution code carries meaning in structure, naming, and control flow, not dense commentary. Fewer comments keep next-token prediction closer to the statistical shape of the code it was trained on.

Comments aren’t a free scratchpad. They inject natural-language tokens into the context window, compete for attention, and bias generation toward explanation rather than implementation, increasing drift over longer spans.

The solution to comment spam isn’t post-processing. It’s keeping generation in-distribution. Less commentary forces intent into the code itself, producing outputs that better match how code is written in the wild, and forcing the model into more realistic context avenues.

DiscourseFan

a month ago

Literally the exact thing I tell new hires on projects for training models: theory is far less important than practice.

We are only just beginning to understand how these things work. I imagine it will end up being similar to Freud’s Oedipal complex: when we failed to have a fully physical understanding of cognition, we employed a schematic narrative. Something similar is already emerging.

foobiekr

a month ago

> would never be able to perform well on novel coding or mathematics tasks. We were wrong

I'm not clear at all we were wrong. A lot of the mathematics announcements have been rolled back and "novel coding" is exactly where the LLMs seem to fail on a daily basis - things that are genuinely not represented in the training set.

lugu

2 months ago

Nice video o mechanical interpretability from Welch Labs:

https://youtu.be/D8GOeCFFby4?si=2rWnwv4M2bjkpEoc

brcmthrowaway

2 months ago

How was reinforcement learning used as a gamechanger?

What happens to an LLM without reinforcement learning?

libraryofbabel

2 months ago

The essence of it is that after the "read the whole internet and predict the next token" pre-training step (and the chat fine-tuning), SotA LLMs now have a training step where they solve huge numbers of tasks that have verifiable answers (especially programming and math). The model therefore gets the very broad general knowledge and natural language abilities from pre-training and gets good at solving actual problems (problems that can't be bullshitted or hallucinated through because they have some verifiable right answer) from the RL step. In ways that still aren't really understood, it develops internal models of mathematics and coding that allow it to generalize to solve things it hasn't seen before. That is why LLMs got so much better at coding in 2025; the success of tools like Claude Code (to pick just one example) is built upon it. Of course, the LLMs still have a lot of limitations (the internal models are not perfect and aren't like how humans think at all), but RL has taken us pretty far.

Unfortunately the really interesting details of this are mostly secret sauce stuff locked up inside the big AI labs. But there are still people who know far more than I do who do post about it, e.g. Andrej Karpathy discusses RL a bit in his 2025 LLMs Year in Review: https://karpathy.bearblog.dev/year-in-review-2025/

brcmthrowaway

2 months ago

Do you have the answer to the second question? Is an LLM trained on the internet just GPT-3?

libraryofbabel

2 months ago

I don't know - perhaps someone who's more of an expert or who's worked a lot with open source models that haven't been RL-ed can weigh in here!

But certainly without the RL step, the LLM would be much worse at coding and would hallucinate more.

malaya_zemlya

2 months ago

You can download a base model (aka foundation, aka pretrain-only) from huggingface and test it out. These were produced without any RL.

However, most modern LLMs, even base models, would be not just trained on raw internet text. Most of them were also fed a huge amount of synthetic data. You often can see the exact details in their model cards. As a result, if you sample from them, you will notice that they love to output text that looks like:

  6. **You will win millions playing bingo.**
     - **Sentiment Classification: Positive**
     - **Reasoning:** This statement is positive as it suggests a highly favorable outcome for the person playing bingo.

This is not your typical internet page.

octoberfranklin

2 months ago

You often can see the exact details in their model cards.

Bwahahahaaha. Lol.

/me falls off of chair laughing

Come on, I've never found "exact details" about anything in a model card, except maybe the number of weights.

HarHarVeryFunny

a month ago

A base LLM that has only been pre-trained (no RL = reinforcement learning), is not "planning" very far ahead. It has only been trained to minimize prediction errors on the next word it is generating. You might consider this a bit like a person who speaks before thinking/planning, or a freestyle rapper spitting out words so fast they only have time to maintain continuity with what they've just said, not plan ahead.

The purpose of RL (applied to LLMs as a second "post-training" stage after pre-training) is to train the LLM to act as if it had planned ahead before "speaking", so that rather than just focusing on the next word it will instead try to choose a sequence of words that will steer the output towards a particular type of response that had been rewarded during RL training.

There are two types of RL generally applied to LLMs.

1) RLHF - RL from Human Feedback, where the goal is to generate responses that during A/B testing humans had indicated a preference for (for whatever reason).

2) RLVR - RL with Verifiable Rewards, used to promote the appearance of reasoning in domains like math and programming where the LLM's output can be verified in someway (e.g. math result or program output checked).

Without RLHF (as was the case pre-ChatGPT) the output of an LLM can be quite unhinged. Without RLVR, aka RL for reasoning, the abilty of the model to reason (or give the appearance of reasoning) is a function of pre-training, and won't have the focus (like putting blinkers on a horse) to narrow generative output to achieve the desired goal.

melagonster

2 months ago

Maybe the most benefits are from the condition that people can read another new paper with enough background knowledge.

nrhrjrjrjtntbt

2 months ago

It is almost like understanding wood at a molecular level and being a carpenter. It also may help the carpentery, but you cam be a great one without it. And a bad one with the knowledge.

miki123211

2 months ago

> Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks.

I feel like there are three groups of people:

1. Those who think that LLMs are stupid slop-generating machines which couldn't ever possibly be of any use to anybody, because there's some problem that is simple for humans but hard for LLMs, which makes them unintelligent by definition.

2. Those who think we have already achieved AGI and don't need human programmers any more.

3. Those who believe LLMs will destroy the world in the next 5 years.

I feel like the composition of these three groups is pretty much constant since the release of Chat GPT, and like with most political fights, evidence doesn't convince people either way.

libraryofbabel

2 months ago

Those three positions are all extreme viewpoints. There are certainly people who hold them, and they tend to be loud and confident and have an outsize presence in HN and other places online.

But a lot of us have a more nuanced take! It's perfectly possible to believe simultaneously that 1) LLMs are more than stochastic parrots 2) LLMs are useful for software development 3) LLMs have all sorts of limitations and risks (you can produce unmaintainable slop with them, and many people will, there are massive security issues, I can go on and on...) 4) We're not getting AGI or world-destroying super-intelligence anytime soon, if ever 5) We're in a bubble and it's going to pop and cause a big mess 6) This tech is still going to be transformative long term, on a similar level to the web and smartphones.

Don't let the noise from the extreme people who formed their opinions back when ChatGPT came out drown out serious discussion! A lot of us try and walk a middle course with this and have been and still are open to changing our minds.

boltzmann_

2 months ago

Kudos also to Transformer Explainer team for putting some amazing visualizations https://poloclub.github.io/transformer-explainer/ It really clicked to me after reading this two and watching 3blue1brown videos

gzer0

2 months ago

This is hands down one of the best visualizations I have ever come across.

Koshkin

2 months ago

(Going on a tangent.) The number of transformer explanations/tutorials is becoming overwhelming. Reminds me of monads (or maybe calculus). Someone feels a spark of enlightenment at some point (while, often, in fact, remaining deeply confused), and an urge to share their newly acquired (mis)understanding with a wide audience.

kadushka

2 months ago

Maybe so, but this particular blog post was the first and is still the best explanation of how transformers work.

nospice

2 months ago

So?

There's no rule that the internet is limited to a single explanation. Find the one that clicks for you, ignore the rest. Whenever I'm trying to learn about concepts in mathematics, computer science, physics, or electronics, I often find that the first or the "canonical" explanation is hard for me to parse. I'm thankful for having options 2 through 10.

laser9

2 months ago

Here's the comment from the author himself (jayalammar) talking about other good resources on learning Transformers:

https://news.ycombinator.com/item?id=35990118

some_guy_nobel

2 months ago

Great article, must be the inspiration for the recent Illustrated Evo 2: https://research.nvidia.com/labs/dbr/blog/illustrated-evo2/

gustavoaca1997

2 months ago

I have this book. Really a life savior to help me catching up a few months ago when my team decided to use LLMs in our systems.

qoez

2 months ago

Don't really see why you'd need to understand how the transformer works to do LLMs at work. LLMs is just a synthetic human performing reasoning with some failure modes that in-depth knowledge of the transformer interals won't help you predict what they are (just have to use experience with the output to get a sense, or other peoples experiments).

roadside_picnic

2 months ago

In my experience this is a substantial difference in the ability to really get performance in LLM related engineering work from people who really understand how LLMs work vs people who think it's a magic box.

If your mental model of an LLM is:

> a synthetic human performing reasoning

You are severely overestimating the capabilities of these models and not realizing potential areas of failure (even if your prompt works for now in the happy case). Understanding how transformers work absolutely can help debug problems (or avoid them in the first place). People without a deep understanding of LLMs also tend to get fooled by them more frequently. When you have internalized the fact that LLMs are literally optimistized to trick you, you tend to be much more skeptical of the initial results (which results in better eval suites etc).

Then there's people who actually do AI engineering. If you're working with local/open weights models or on the inference end of things you can't just play around with an API, you have a lot more control and observability into the model and should be making use of it.

I still hold that the best test of an AI Engineer, at any level of the "AI" stack, is how well they understand speculative decoding. It involves understanding quite a bit about how LLMs work and can still be implemented on a cheap laptop.

amelius

2 months ago

But that AI engineer who is implementing speculative decoding is still just doing basic plumbing that has little to do with the actual reasoning. Yes, he/she might make the process faster, but they will know just as little about why/how the reasoning works as when they implemented a naive, slow version of the inference.

roadside_picnic

2 months ago

What "actual reasoning" are you referring to? I believe you're making my point for me.

Speculative decoding requires the implementer to understand:

- How the initial prompt is processed by the LLM

- How to retrieve all the probabilities of previously observed tokens in the prompt (this also help people understand things like the probability of the entire prompt itself, the entropy of the prompt etc).

- Details of how the logits generate the distribution of next tokens

- Precise details of the sampling process + the rejection sampling logic for comparing the two models

- How each step of the LLM is run under-the-hood as the response is processed.

Hardly just plumbing, especially since, to my knowledge, there are not a lot of hand-holding tutorials on this topic. You need to really internalize what's going on and how this is going to lead to a 2-5x speed up in inference.

Building all of this yourself gives you a lot of visibility into how the model behaves and how "reasoning" emerges from the sampling process.

edit: Anyone who can perform speculative decoding work also has the ability to inspect the reasoning steps of an LLM and do experiments such as rewinding the thought process of the LLM and substituting a reasoning step to see how it impacts the results. If you're just prompt hacking you're not going to be able to perform these types of experiments to understand exactly how the model is reasoning and what's important to it.

amelius

2 months ago

But I can make a similar argument about a simple multiplication:

- You have to know how the inputs are processed.

- You have to left-shift one of the operands by 0, 1, ... N-1 times.

- Add those together, depending on the bits in the other operand.

- Use an addition tree to make the whole process faster.

Does not mean that knowing the above process gives you a good insight in the concept of A*B and all the related math and certainly will not make you better at calculus.

roadside_picnic

2 months ago

I'm still confused by what you meant by "actual reasoning", which you didn't answer.

I also fail to understand how building what you described would not help your understanding of multiplication, I think it would mean you understand multiplication much better than most people. I would also say that if you want to be a "multiplication engineer" then, yes you should absolutely know how to do what you've described there.

I also suspect you might have lost the main point. The original comment I was replying to stated:

> Don't really see why you'd need to understand how the transformer works to do LLMs at work.

I'm not saying implementing speculative decoding is enough to "fully understand LLMs". I'm saying if you can't at least implement that, you don't understand enough about LLMs to really get the most out of them. No amount of twiddling around with prompts is going to give you adequate insight into how an LLMs works to be able to build good AI tools/solutions.

machinationu

2 months ago

speculative decoding is 1+1

transformer attention is integrals

bonesss

2 months ago

> LLMs is just a synthetic human

1) ‘human’ encompasses behaviours that include revenge cannibalism and recurrent sexual violence —- wish carefully.

2) not even a little bit, and if you want to pretend then pretend they’re a deranged delusional psych patient who will look you in the eye and say genuinely “oops, I guess I was lying, it won’t ever happen again” and then lie to you again, while making sure happens again.

3) don’t anthropomorphize LLMs, they don’t like it.

Koshkin

2 months ago

> is just a synthetic human performing reasoning

The future is now! (Not because of "a synthetic human" per se but because of people thinking of them as something unremarkable.)

ActorNightly

2 months ago

People need to get away from this idea of Key/Query/Value as being special.

Whereas a standard deep layer in a network is matrix * input, where each row of the matrix is the weights of the particular neuron in the next layer, a transformer is basically input* MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just simply more dimensions in a layer.

And consequently, you can represent the entire transformer architecture with a set of deep layers as you unroll the matricies, with a lot of zeros for the multiplication pieces that are not needed.

This is a fairly complex blog but it shows that its just all matrix multiplication all the way down. https://pytorch.org/blog/inside-the-matrix/.

throw310822

2 months ago

I might be completely off road, but I can't help thinking of convolutions as my mental model for the K Q V mechanism. Attention has the same property of a convolution kernel of being trained independently of position; it learns how to translate a large, rolling portion of an input to a new "digested" value; and you can train multiple ones in parallel so that they learn to focus on different aspects of the input ("kernels" in the case of convolution, "heads" in the case of attention).

krackers

2 months ago

I think there are two key differences though: 1) Attention doesn't doesn't use fixed distance-dependent weight for the aggregation but instead the weight becomes "semantically-dependent", based on association between q/k. 2) A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation, pulling from the hidden states of all previous tokens. (Maybe sliding window attention schemes muddy this distinction, but in general the degree of connectivity seems far higher).

There might be some unifying way to look at things though, maybe GNNs. I found this talk [1] and at 4:17 it shows how convolution and attention would be modeled in a GNN formalism

[1] https://www.youtube.com/watch?v=J1YCdVogd14

ActorNightly

a month ago

>A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation.

In the same way where the learned weights to generate K,Q,V matricies may have zeros (or small values) for referencing certain tokens, convolution kernels just have defined zeros.

sifar

2 months ago

Nested concolutikns, dilated convolutiona both can pull in data from further afar.

ActorNightly

a month ago

The whole reason for the first "AI Winter" was because people were trying to solve problems with smaller neural nets, and of course you run into problems during training, where you can't get things to converge.

Once compute became more available, and you had more neural nets, and thus more dimensionality (in the sense of layer sizes), during training, you had more directions for gradient descent, so things started happening with ML.

And all the architectures that you see today are basically simplifications of the fully connected layers with max dimensionality. Any operation like attention, self attention, or convolution can be unrolled into matrix multiples.

I wouldn't be surprised if Google TPUs basically do this. It seems to reason that they are the most efficient because they don't move memory around, which means that the matrix multiply circuitry is hard wired, which means that the compiler basically has to lay out the locations of the data in the spaces that are meant to be matrix multiplied together, so the compiler probably does that unrolling under the hood.

prashant418

2 months ago

This guide is such a beast, Try pairing this guide with say claude code and ask it to generate sample mini pytorch pesudo-code and you can spend hours just learning/re-learning and mentally visualize a lot of these concepts. I am a big fan

user

a month ago

[deleted]

Simplita

2 months ago

Visual explanations like this make it clearer why models struggle once context balloons. In practice, breaking problems into explicit stages helped us more than just increasing context length.

zkmon

2 months ago

I think the internal of transformers would become less relevant like internal of compilers, as programmers would only care about how to "use" them instead of how to develop them.

rvz

2 months ago

Their internals are just as relevant (now even more relevant) as any other technology as they always need to be improved to the SOTA (state of the art) meaning that someone has to understand their internals.

It also means more jobs for the people who understand them at a deeper level to advance the SOTA of specific widely used technologies such as operating systems, compilers, neural network architectures and hardware such as GPUs or TPU chips.

Someone has to maintain and improve them.

crystal_revenge

2 months ago

Have you written a compiler? I ask because for me writing a compiler was absolutely an inflection point in my journey as a programmer. Being able to look at code and reason about it all the way down to bytecode/IL/asm etc absolutely improved my skill as a programmer and ability to reason about software. For me this was the first time I felt like a real programmer.

zkmon

2 months ago

Writing a compiler is not a requirement or good use of time for a programmer. Same as why driving a car should not require you to build the car engine. Driver should stick to their role and learn how to drive properly.

crystal_revenge

a month ago

I'm guessing the answer then is "no".

That's a ridiculous metaphor as well because building a compiler is a massive software engineering project that covers a huge range of essential skills. That metaphor would work for building a computer, but not a compiler.

Clearly it shouldn't be a requirement, but it is an excellent use of a programmer's time. I can think of no software project over my career that has improved my skills more than writing a compiler.

esafak

2 months ago

Practitioners already do not need to know about it to run let alone use LLMs. I bet most don't even know the fundamentals of machine learning. Hands up if you know bias from variance...

edge17

2 months ago

Maybe I'm out of touch, but have transformers replaced all traditional deep learning architectures? (U-nets, etc)?

D-Machine

2 months ago

No, not at all. There is a transformer obsession that is quite possibly not supported by the actual facts (CNNs can still do just as well: https://arxiv.org/abs/2310.16764), and CNNs definitely remain preferable for smaller and more specialized tasks (e.g. computer vision on medical data).

If you also get into more robust and/or specialized tasks (e.g. rotation invariant computer vision models, graph neural networks, models working on point-cloud data, etc) then transformers are also not obviously the right choice at all (or even usable in the first place). So plenty of other useful architectures out there.

menaerus

2 months ago

Using transformers does not mutually exclude other tools in the sleeve.

What about DINOv2 and DINOv3, 1B and 7B, vision transformer models? This paper [1] suggests significant improvements over traditional YOLO-based object detection.

[1] https://arxiv.org/html/2509.20787v2

D-Machine

a month ago

Indeed, there are even multiple attempts to use both self-attention and convolutions in novel architectures, and there is evidence this works very well and may have significant advantages over pure vision transformer models [1-2].

IMO there is little reason to think transformers are (even today) the best architecture for any deep learning application. Perhaps if a mega-corp poured all their resources into some convolutional transformer architecture, you'd get something better than just the current vision transformer (ViT) models, but, since so much optimizations and work on the training of ViTs has been done, and since we clearly still haven't maxed out their capacity, it makes sense to stick with them at scale.

That being said, ViTs are still currently clearly the best if you want something trained on a near-entire-internet of image or video data.

[1] https://arxiv.org/abs/2103.15808

[2] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=convo...

edge17

2 months ago

Is there something I can read to get a better sense of what types of models are most suitable for which problems? All I hear about are transformers nowadays, but what are the types of problems for which transformers are the right architecture choice?

D-Machine

a month ago

Just do some basic searches on e.g. Google Scholar for your task (e.g. "medical image segmentation", "point cloud segmentation", "graph neural networks", "timeseries classification", "forecasting") or task modification (e.g. "'rotation invariant' architecture") or whatever, sort by year, make sure to click on papers that have a large number of citations, and start reading. You will start to get a feel for domains or specific areas where transformers are and are not clearly the best models. Or just ask e.g. ChatGPT Thinking with search enabled about these kinds of things (and then verify the answer by going to the actual papers).

Also check HuggingFace and other model hubs and filter by task to see if any of these models are available in an easy-to-use format. But most research models will only be available on GitHub somewhere, and in general you are just deciding between a vision transformer and the latest convolutional model (usually a ConvNext vX for some X).

In practice, if you need to work with the kind of data that is found online, and don't have a highly specialized type of data or problem, then you do, today, almost always just want some pre-trained transformer.

But if you actually have to (pre)train a model from scratch on specialized data, in many cases you will not have enough data or resources to get the most out of a transformer, and often some kind of older / simpler convolutional model is going to give better performance at less cost. Sometimes in these cases you don't even want a deep-learner at all, and just classic ML or algorithms are far superior. A good example would be timeseries forecasting, where embarrassingly simple linear models blow overly-complicated and hugely expensive transformer models right out of the water (https://arxiv.org/abs/2205.13504).

Oh, right, and unless TabPFNv2 (https://www.nature.com/articles/s41586-024-08328-6) makes sense for your use-case, you are still better off using boosted decision trees (e.g. XGBoost, LightGBM, or CatBoost) for tabular data.

bearsortree

a month ago

i found this much more intuitive to follow, https://poloclub.github.io/transformer-explainer/

user

2 months ago

[deleted]

profsummergig

2 months ago

Haven't watched it yet...

...but, if you have favorite resources on understanding Q & K, please drop them in comments below...

(I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me).

Thank you in advance.

roadside_picnic

2 months ago

It's just a re-invention of kernel smoothing. Cosma Shalizi has an excellent write up on this [0].

Once you recognize this it's a wonderful re-framing of what a transformer is doing under the hood: you're effectively learning a bunch of sophisticated kernels (though the FF part) and then applying kernel smoothing in different ways through the attention layers. It makes you realize that Transformers are philosophically much closer to things like Gaussian Processes (which are also just a bunch of kernel manipulation).

0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

leopd

2 months ago

I think this video does a pretty good job explaining it, starting about 10:30 minutes in: https://www.youtube.com/watch?v=S27pHKBEp30

oofbey

2 months ago

As the first comment says "This aged like fine wine". Six years old, but the fundamentals haven't changed.

andoando

2 months ago

This wasn't any better than other explanation I've seen.

red2awn

2 months ago

Implement transformers yourself (ie in Numpy). You'll never truly understand it by just watching videos.

D-Machine

2 months ago

Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases, where x and y are are outputs from previous layers.

Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation.

tayo42

2 months ago

>the terms "Query" and "Value" are largely arbitrary and meaningless in practice

This is the most confusing thing about it imo. Those words all mean something but they're just more matrix multiplications. Nothing was being searched for.

D-Machine

2 months ago

Better resources will note the terms are just historical and not really relevant anymore, and just remain a naming convention for self-attention formulas. IMO it is harmful to learning and good pedagogy to say they are anything more than this, especially as we better understand the real thing they are doing is approximating feature-feature correlations / similarity matrices, or perhaps even more generally, just allow for multiplicative interactions (https://openreview.net/forum?id=rylnK6VtDH).

profsummergig

2 months ago

Do you think the dimension reduction is necessary? Or is it just practical (due to current hardware scarcity)?

D-Machine

2 months ago

Definitely mostly just a practical thing IMO, especially with modern attention variants (sparse attention, FlashAttention, linear attention, merged attention etc). Not sure it is even hardware scarcity per se / solely, it would just be really expensive in terms of both memory and FLOPs (and not clearly increase model capacity) to use larger matrices.

Also for the specific part where you, in code for encoder-decoder transformers, call the a(x, x, y) function instead of the usual a(x, x, x) attention call (what Alammar calls "encoder-decoder attention" in his diagram just before the "The Decoder Side"), you have different matrix sizes, so dimension reduction is needed to make the matrix multiplications work out nicely too.

But in general it is just a compute thing IMO.

roadside_picnic

2 months ago

I personally don't think implementation is as enlightening as far as really understanding what the model is doing as this statement implies. I had done that many times, but it wasn't until reading about the relationship to kernel methods that it really clicked for me what is really happening under the hood.

Don't get me wrong, implementing attention is still great (and necessary), but even with something as simple as linear regression, implementing it doesn't really give you the entire conceptual model. I do think implementation helps to understand the engineering of these models, but it still requires reflection and study to start to understand conceptually why they are working and what they're really doing (I would, of course, argue I'm still learning about linear models in that regard!)

krat0sprakhar

2 months ago

Do you have a tutorial that I can follow?

jwitthuhn

2 months ago

If you have 20 hours to spare I highly recommend this youtube playlist from Andrej Karpathy https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

It starts with the fundamentals of how backpropagation works then advances to building a few simple models and ends with building a GPT-2 clone. It won't taech you everything about AI models but it gives you a solid foundation for branching out.

roadside_picnic

2 months ago

The most valuable tutorial will be translating from the paper itself. The more hand holding you have in the process, the less you'll be learning conceptually. The pure manipulation of matrices is rather boring and uninformative without some context.

I also think the implementation is more helpful for understanding the engineering work to run these models that getting a deeper mathematical understanding of what the model is doing.

throw310822

2 months ago

Have you tried asking e.g. Claude to explain it to you? None of the usual resources worked for me, until I had a discussion with Claude where I could ask questions about everything that I didn't get.

sakesun

2 months ago

Perhaps we have already reached ASI. :)

throw310822

2 months ago

In some respects, yes. There is no single human being with a general knowledge as vast as that of a SOTA LLM, or able to speak as many languages. Claude knows about transformers more than enough to explain them to a layperson, elucidating specific points and resolving doubts. As someone who learns more easily by prodding other people's knowledge rather than from static explanations, I find LLMs extremely useful.

machinationu

2 months ago

Q, K and V are a way of filtering the relevant aspects for the task at hand from the token embeddings.

"he was red" - maybe color, maybe angry, the "red" token embedding carries both, but only one aspect is relevant for some particular prompt.

https://ngrok.com/blog/prompt-caching/

oedemis

a month ago

there is also very good explanation from Luis Serrano, https://youtu.be/fkO9T027an0

bobbyschmidd

2 months ago

tldr: recursively aggregating packing/unpacking 'if else if (functions)/statements' as keyword arguments that (call)/take them themselves as arguments, with their own position shifting according to the number "(weights)" of else if (functions)/statements needed to get all the other arguments into (one of) THE adequate orders. the order changes based on the language, input prompt and context.

if I understand it all correctly.

implemented it in html a while ago and might do it in htmx sometime soon.

transformers are just slutty dictionaries that Papa Roach and kage bunshin no jutsu right away again and again, spawning clones and variations based on requirements, which is why they tend to repeat themselves rather quickly and often. it's got almost nothing to do with languages themselves and requirements and weights amount to playbooks and DEFCON levels