jychang
9 hours ago
Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.
Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...
But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.
Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...
So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.
puilp0502
8 hours ago
What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
jychang
8 hours ago
Speculative decoding! It makes inference a LOT faster.
Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.
stingraycharles
4 hours ago
Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea?
I’m not an expert on LLMs, just a user.
bdcs
2 hours ago
It relies on an “unintuitive observation”[0] that you can run batches basically for free (up to a limit). So if you only run one inference, you batch it plus a lot of guesses and, if you guess right, can speed up the inference by the number of guesses. If you guess wrong, you're back to regular speed (and still fully correct).
tomp
2 hours ago
No, the parent is wrong.
Checking a token is the same as generating it.
The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.
namibj
3 hours ago
Basically you can generate the next two tokens at once in the same matmul, and rollback to one-at-a-time when your generation said you guessed wrong (as that will mean the second of your pair you generated was generated based on revoked context).
moffkalast
7 hours ago
Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions?
jychang
6 hours ago
Because if you generate token n+1 with all 48 layers of Qwen3-Next and 80 billion params, and also generate token n+2 with the 1 MTP layer at 2bil params... that n+2 token can be much lower quality than the n+1 token but mostly correct.
Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
littlestymaar
5 hours ago
> Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
That doesn't match my understanding of what speculative decoding does: AFAIK with regular speculative decoding you ask a smaller llm infer the next few tokens (let say 5 tokens) and then, you can have the big model infer token 1, 2, 3, 4, 5 and 6 in parallel (each time starting from the sentence partially completed by the smaller model). Because llms are bandwidth bound, doing the same work six times in parallel isn't slower than doing it only once (what's costly is moving the massive model weights between VRAM and the GPU cores).
If token 1,2 and 3 match what the small models inferred, then you keep them. As soon as you have a mismatched token (say token 4) it means that you have to discard the next inferred tokens (here token 5 and 6) because they were calculated under a wrong assumption for token 4.
So if the MTP layer merely replace the smaller llm in the previous scheme with everything else working the same way, you would save anything when inferring “Obama” (you'd still need to “generate it the regular way”, as there isn't really another way) but you could also start working on the word immediately after “Obama” by assuming “Obama” was already chose. And if the model actually outputted “Hussein” instead of “Obama”, then the token calculated to happen after “Obama” would have to be discarded.
Or maybe my understanding of speculative decoding is completely off…
vman512
2 hours ago
Sounds right. The policy for rejection can depend on what you want - you might accept the top K highest probability tokens or top P probability mass. Or you can do something like importance sampling and probabilistically reject based on the ratio of likelihoods
SonOfLilit
7 hours ago
If you ask me to guess an answer, I'll _usually_ produce the same answer as if I had time to think about it deeply, but not always...
eldenring
3 hours ago
the 2nd token is generated without knowing what token was chosen for the 1st token
EMM_386
5 hours ago
I believe it's something along these lines. The MTP head runs simultaneously and generates a probability list based on what it thinks the results will be, learned during training.
If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90) If n+1 = "The" then n+2 = "quick" (confidence: 0.45) If n+1 = "President" then n+2 = "Biden" (confidence: 0.75)
A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough".
namibj
3 hours ago
Well yeah; also inference benefits massively from batching, so you use the guesses to pre fill context needed to infer the next speculated tokens, and if the guesses were wrong, you just have to re-compute the speculated ones that depended on the guessed context.
You compute the next token and guess the one after; then you try to take the guess for real and compute the one after together with running inference for the guessed one, and the one after is speculated on the guess being correct.
cubefox
3 hours ago
> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...
rfoo
8 hours ago
It could be a better draft model than separately trained EAGLE etc for speculative decoding.
humblyCrazy
3 hours ago
How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?
Razengan
5 hours ago
Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')
lcnPylGDnU4H9OF
4 hours ago
The best primer I've seen is Andrej Karpathy's first video in his "zero to hero" series. It's worth following along with your own practice.
vessenes
4 hours ago
Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.
Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!
littlestymaar
an hour ago
> Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate.
I don't think it move this fast.
I mean there is very little fundamental differences between GPT-2 and gpt-oss-120b, it's just about incremental improvement that don't change much to the full picture (using a variation of the attention architecture and masking, a different activation function, the positional encoding and changing the NLP layers to a sparse “mixture of expert”), at the end of the day, from Mistral to Deepseek going through llama and Qwen3 it's always the same stack of transformers layers with slight variations between two architectures.
This Qwen3-Next is special though, as it's the first time a major player is releasing something that different (lesser players have made hybrid architecture LLMs for the past two years, but when it comes to language models, IBM really isn't comparable to Alibaba). This is what I expected Llama4 to be.
pmarreck
2 hours ago
The following was generated by chatG5:
Qwen3-Next — A family of large language models from Qwen (Alibaba).
DeepSeek R1 — Another large open-source language model from DeepSeek AI.
Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.
MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.
Embedding — Converts words/tokens into vectors (numbers) the model can work with.
Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.
embed_tokens — The big lookup table of embeddings (token → vector).
shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.
[129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).
FP8 — Floating-point format using 8 bits (compact, faster, less precise).
Active parameters — The weights that actually need to be loaded in GPU memory to run the model.
Inference — Running the model to generate text (as opposed to training it).
GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.
porridgeraisin
4 hours ago
Background:
LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).
---
When training, you make an LLM learn that
I use arch = downscale(upscale(I use))
If you want to predict the next word after that, you do next in sequence the following:
I use arch btw = downscale(upscale(I use arch))
Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.
i.e in parallel:
I use arch = downscale1(upscale(I use))
I use ____ btw = downscale2(upscale(I use))
However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.
What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.
So overall, you save params.
Concretely,
Before: downscale1.params + downscale2.params
After: downscale_common.params + lightweight1.params + lightweight2.params
Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.
losvedir
43 minutes ago
Ooooh, neat! That was very well explained, thank you.
pmarreck
2 hours ago
so after your edit it would be (just to clarify):
I use ____ ___ = downscale_common(lightweight1(.)) + downscale_common(lightweight2(.)) ?
And does it generate 2 at a time and keep going that way, or is there some overlap?JSR_FDED
3 hours ago
Really good
fortyseven
3 hours ago
Dude, this was like that woosh of cool air on your brain when an axe splits your head in half. That really brought a lot of stuff into focus.
wickedsight
4 hours ago
For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.