inerte
6 days ago
Not sure if I would tradeoff speed for accuracy.
Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.
But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.
I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.
But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.
I guess only giving a try will tell.
XenophileJKO
6 days ago
So my personal belief is that diffusion models will enable higher degrees of accuracy. This is because unlike an auto-regressive model it can adjust a whole block of tokens when it encounters some kind of disjunction.
Think of the old example where an auto regressive model would output: "There are 2 possibilities.." before it really enumerated them. Often the model has trouble overcoming the bias and will hallucinate a response to fit the proceeding tokens.
Chain of thought and other approaches help overcome this and other issues by incentivizing validation, etc.
With diffusion however it is easier for the other generated answer to change that set of tokens to match the actual number of possibilities enumerated.
This is why I think you'll see diffusion models be able to do some more advanced problem solving with a smaller number of "thinking" tokens.
pama
6 days ago
Unfortunately the intuition and the math proofs so far suggest that autoregressive training is learning the joint distribution of probabilistic streams of tokens much better than diffision models do or will ever do. My intuitive take is that the conditional probability distribtion of decoder-only autoregressive models is at just the right level of complexity for probabilistic models to learn accurately enough. Intuitively (and simplifying things at the risk of breaking rigor), the diffusion (or masked models) have to occasionally issue tokens with less information and thus higher variance than a pure autoregressive model would have to do, so the joint distribution, ie the probability of the whole sentence/answer will be lower and thus diffusion models will never get precise enough. Of course, during generation the sampling techniques influence the above simplified idea dramatically and the typical randomized sampling for next token prediction is suboptimal and could be beaten by a carefully designed block diffusion sampler in principle in some contexts though I havent seen real examples of it yet. But the key ideas of the above scribbles are still valid: autoregresive models will always be better (or at least equal) probabilistic models of sequential data than diffusion models will be. So the diffusion models mostly offer a tradeoff for performance vs quality. Sometimes there is a lot of room for that tradeoff in practice.
niemandhier
6 days ago
This is tremendously interesting!
Could you point me to some literature? Especially regarding mathematical proofs of your intuition?
I’d like to recalibrate my priors to align better with current research results.
GistNoesis
6 days ago
From the mathematical point of view the literature is about the distinction between a "filtering" distribution and a "smoothing" distribution. The smoothing distribution is strictly more powerful.
In theory intuitively the smoothing distribution has access to all the information that the filtering distribution has and some additional information therefore has a minimum lower than the filtering distribution.
In practice, because the smoothing input space is much bigger, keeping the same number of parameters we may not reach a better score because with diffusion we are tackling a much harder problem (the whole problem), whereas with autoregressive models we are taking a shortcut which happens to probably be one that humans are probably biased too (communication evolved so that it can be serialized to be exchanged orally).
pama
6 days ago
Although what you say about smoothing vs filtering is true in principle, for conditional generation of the eventual joint distribution starting from the same condition and using an autoregresive vs diffusive LLM, it is the smoothing distribution that has less power. In other words, during inference starting from J tokens and writing token number K is of course better with diffusion if you also have some given tokens after token K and up to the maximal token N. However, if your input is fixed (tokens up to J) and you have to predict those additional tokens (from J+1 to N), you are solving a harder problem and have a lower joint probability at the end of the inference for the full generated sequence from J+1 up to N.
pama
6 days ago
I am still jetlagged and not sure what the most helpful reference would be. Maybe start from the block diffusion paper I recommended in a parallel thread and trace your way up/down from there. The logic leading to Eq 6 is a special case of such a math proof.
kmacdough
6 days ago
What are the barriers to mixed architecture models? Models which could seamlessly pass from autoregressive to diffusion, etc.
Humans can integrate multiple sensory processing centers and multiple modes of thought all at once. It's baked into our training process (life).
pama
6 days ago
The human processing is still autoregressive, but using multiple parallel synchronized streams. There is no problem with such an approach and my best guess is that in the next year we will see many teams training models using such tricks for generating reasoning traces in parallel.
The main concern is taking a single probabilistic stream (eg a book) and comparing autoregressive modelling of it with a diffusive modelling of it.
Regarding mixing diffusion and autoregressive—I was at ICLR last week and this work is probably relevant: https://openreview.net/forum?id=tyEyYT267x
cchance
6 days ago
Maybe diffusion for "thoughts" and autoregressive for output :S
efavdb
6 days ago
Suggests an opportunity for hybrids, where the diffusion model might be responsible for large scale structure of response and the next token model for filling in details. Sort of like a multi scale model in dynamics simulations.
AlexCoventry
6 days ago
> it can adjust a whole block of tokens when it encounters some kind of disjunction.
This is true in principle for general diffusion models, but I don't think it's true for the noise model they use in Mercury (at least, going by a couple of academic papers authored by the Inception co-founders.) Their model generates noise by masking a token, and once it's masked, it stays masked. So the reverse-diffusion gets to decide on the contents of a masked token once, and after that it's fixed.
freeqaz
6 days ago
Here are two papers linked from Inception's site:
1. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution - https://arxiv.org/abs/2310.16834
2. Simple and Effective Masked Diffusion Language Models - https://arxiv.org/abs/2406.07524
AlexCoventry
6 days ago
Thanks, yes, I was thinking specifically of "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution". They actually consider two noise distributions: one with uniform sampling for each noised token position, and one with a terminal masking (the Q^{uniform} and Q^{absorb}.) However, the terminal-masking system is clearly superior in their benchmarks.
macleginn
6 days ago
The exact types of path dependencies in inference on text-diffusion models look like an interesting research project.
AlexCoventry
6 days ago
Yes, the problem is coming up with a noise model where reverse diffusion is tractable.
XenophileJKO
6 days ago
Thank you, I'll have to read the papers. I don't think I have read theirs.
fizx
6 days ago
Once that auto-regressive model goes deep enough (or uses "reasoning"), it actually has modeled what possibilities exist by the time it's said "There are 2 possibilities.."
We're long past that point of model complexity.
klipt
6 days ago
But as everyone knows, computer science has two hard problems: naming things, cache invalidation, and off by one errors.
tyre
6 days ago
Check out RooCode if you haven’t. There’s an orchestrator mode that can start with a big model to come up with a plan and break down, then spin out small tasks to smaller models for scoped implementation.
danenania
6 days ago
If you’re open to a terminal-based approach, this is exactly what my project Plandex[1] focuses on—breaking up and completing large tasks step by step.
amelius
6 days ago
Wouldn't it be possible to trade speed back for accuracy, e.g. by asking the model to look at a problem from different angles, let it criticize its own output, etc.?
nowittyusername
6 days ago
Just have it sample for longer or create a simple workflow that uses a monte carlo tree search approach. Don't see why this wont improve accuracy. I would love to see someone run tests to see how accurate the model is compared to similar parameter models in a per time block benchmark. Like if it can get same accuracy as a similar parameter autoregressive model but with half the speed, you already have a winner, besides other advantages of a diffusion based model.
kadushka
6 days ago
AI field desperately needs smarter models - not faster models.
geysersam
6 days ago
Definitely needs faster and cheaper models. Fast and cheap models could replace software in tons of situations. Imagine a vending machine or a mobile game or a word processor where basically all logic is implemented as a prompt to an llm. It would serve as the ultimate high level programming language.
janalsncm
6 days ago
I think natural language to code is the right abstraction. Easy enough barrier to entry but still debuggable. Debugging why an LLM randomly gives you Mountain Dew instead of Sprite if you have a southern accent sounds like a nightmare.
geysersam
6 days ago
I'm not sure it would be that hard to debug. Make sure you can reproduce the llm state (by storing the random seed for the session, or something like that) and then ask it "why did you just now give that customer mountain dew when they ordered sprite?"
otabdeveloper4
6 days ago
> and then ask it "why did you just now give that customer mountain dew when they ordered sprite?"
Worse than useless for debugging.
An LLM can't think and doesn't have capabilities for self-reflection.
It will just generate a plausible stream of tokens in reply that may or may not correspond to the real reason why.
geysersam
6 days ago
Of course a llm can't think. But that doesn't mean it can't answer simple questions about the output that was produced. Just try it out with chatgpt when you have time. Even if it's not perfectly accurate it's still useful for debugging.
Just think about it as a human employee. Can they always say why they did what they did? Often, but not always. Sometimes you will have to work to figure out the misunderstanding.
otabdeveloper4
6 days ago
> it's still useful for debugging
How so? What the LLM says is whatever is more likely given the context. It has no relation to the underlying reality whatsoever.
geysersam
6 days ago
Not sure what you mean by "relation to the underlying reality". The explanation is likely to be correlated with the underlying reason for the answer.
For example, here is a simple query:
> I put my bike in the bike stand, I removed my bike from the bike stand and biked to the beach, then I biked home, where is my bike. Answer only with the answer and no explanation
> Chatgpt: At home
> Why did you answer that?
> I answered that your bike is at home because the last action you described was biking home, which implies you took the bike with you and ended your journey there. Therefore, the bike would logically be at home now.
Do you doubt that the answer would change if I changed the query to make the final destination be "the park" instead of "home"? If you don't doubt that, what do you mean that the answer doesn't correspond to the underlying reality? The reality is the answer depends on the final destination mentioned, and that's also the explanation given by the LLM, clearly the reality and the answers are related.
janalsncm
6 days ago
You need to find an example of the LLM making a mistake. In your example, ChatGPT answered correctly. There are many examples online of LLMs answering basic questions incorrectly, and then the person asking the LLM why it did so. The LLM response is usually nonsense.
Then there is the question of what you would do with its response. It’s not like code where you can go in and update the logic. There are billions of floating point numbers. If you actually wanted to update the weights you’ll quickly find yourself fine-tuning the monstrosity. Orders of magnitude more work than updating an “if” statement.
geysersam
5 days ago
I don't think llms always can give correct explanations for their answers. That's a misunderstanding.
> Then there is the question of what you would do with its response. I
Sure but that's a separate question. I'd say the first course of action would be to edit the prompt. If you have to resort to fine tuning I'd say the approach has failed and the tool was insufficient for the task.
janalsncm
5 days ago
It’s not really a separate question imo. We want to know whether computer code or prompts are better for programming things like vending machines.
For LLMs, interpretability is one problem. The ability to effectively apply fixes is another. If we are talking about business logic, have the LLM write code for it and don’t tie yourself in knots begging the LLM to do things correctly.
There is a grey area though, which is where code sucks and statistical models shine. If your task was to differentiate between a cat and a dog visually, good luck writing code for that. But neural nets do that for breakfast. It’s all about using the right tool for the job.
otabdeveloper4
5 days ago
> The explanation is likely to be correlated with the underlying reason for the answer.
No it isn't. You misunderstand how LLMs work. They're giant Mad Libs machines: given these surrounding words, fill in this blank with whatever statistically is most likely. LLMs don't model reality in any way.
geysersam
5 days ago
Did you read the example above? Do you disagree that the LLM provided a correct explanation for the reason it answered as it did?
> They're giant Mad Libs machines: given these surrounding words, fill in this blank with whatever statistically is most likely. LLMs don't model reality in any way.
Not sure why you think this is incompatible with the statement you disagreed with.
otabdeveloper4
5 days ago
> Do you disagree that the LLM provided a correct explanation for the reason it answered as it did?
Yes, I do. An LLM replies with the most likely string of tokens. Which may or may not correspond with the correct or reasonable string of tokens, depending on how stars align. In this case the statistically most likely explanation the LLM replied with just happened to correspond with the correct one.
geysersam
5 days ago
> In this case the statistically most likely explanation the LLM replied with just happened to correspond with the correct one.
I claim that case is not so uncommon as people in this thread seem to think
janalsncm
6 days ago
Why not just store the state in the code and debug as usual, perhaps with LLM assistance? At least that’s tractable.
suddenlybananas
6 days ago
Why on earth would you implement a vending machine using an LLM?
K0balt
6 days ago
The same reason we make the butter dish suffer from existential angst.
jacob019
6 days ago
Because it's easy and cheap. Like how many products use a Raspberry Pi or ESP32 when an ATtiny would do.
kadushka
6 days ago
How in the world is this easy and cheap? Are you planning to run this LLM inside the vending machine? Or are you planning to send those prompts to a remote LLM somewhere?
geysersam
6 days ago
The premise here is that the model runs fast and cheap. With the current state of the technology running a vending machine using an LLM is of course absurd. The point is that accuracy is not the only dimension that brings qualitative change to the kind of applications that LLMs are useful for.
kadushka
6 days ago
Running a vending machine using an LLM is absurd not because we can't run LLMs fast or cheap enough - it's because LLMs are not reliable, and we don't know yet how to make them more reliable. Our best LLM - o3 - doubled the previous model (o1) hallucination rate. OpenAI says it hallucinated a wrong answer 33% of the time in benchmarks. Do you want a vending machine that screws up 33% of the time?
Today, the accuracy of LLMs is by far a bigger concern (and a harder problem to solve) than its speed. If someone releases a model which is 10x slower than o3, but is 20% better in terms of accuracy, reliability, or some other metric of its output quality, I'd switch to it in a heartbeat (and I'd be ready to pay more for it). I can't wait until o3-pro is released.
geysersam
5 days ago
Do you seriously think a typical contemporary LLM would screw up 33% of vending machine orders?
I don't know what benchmark you're looking at but I'm sure the questions in it were more complicated than the logic inside a vending machine.
Why don't you just try it out? It's easy to simulate, just tell the bot about the task and explain to it what actions to perform in different situations, then provide some user input and see if it works or not.
K0balt
6 days ago
You could run a 3B model on 200 dollars worth of hardware and it would do just fine, 100 percent of the time, most of the time. I could definitely see someone talking it out of a free coke now and then though.
With vending machines costing 2-5k, it’s not out of the question, but it’s hard to imagine the business case for it. Maybe the tantalizing possibility of getting a free soda would attract traffic and result in additional sales from frustrated grifters? Idk.
Wazako
6 days ago
Yet deepseek has shown that more dialogue increases quality. Increasing speed is therefore important if you need thinking models.
guiriduro
6 days ago
If you have much more speed in the available time, for an activity like coding, you could use that for iteration, writing more tests and satisfying them, especially if you can pair that with a concurrent test runner to provide feedback. I'm not sure the end result would be lower scoring/smartness than an LLM could achieve in the same duration.
kadushka
6 days ago
I'm not sure the end result would be lower scoring/smartness than an LLM could achieve in the same duration.
It probably wouldn’t with current models. That’s exactly why I said we need smarter models - not more speed. Unless you want to “use that for iteration, writing more tests and satisfying them, especially if you can pair that with a concurrent test runner to provide feedback.” - I personally don’t.
otabdeveloper4
6 days ago
LLM's can't think, so "smarter" is not possible.
IshKebab
6 days ago
They can by the normal English definitions of "think" and "smart". You're just redefining those words to exclude AI because you feel threatened by it. It's tedious.
otabdeveloper4
6 days ago
Incorrect. LLM's have no self-reflection capability. That's a key prerequisite for "thinking". ("I think, therefore I am.")
They are simple calculators that answer with whatever tokens are most likely given the context. If you want reasonable or correct answers (rather than the most likely) then you're out of luck.
IshKebab
6 days ago
It is not a key prerequisite for "thinking". It's "I think therefore I am" not "I am self-aware therefore I think".
In the 90s if your cursor turned into an hourglass and someone said "it's thinking" would you have pedantically said "NO! It is merely calculating!"
Maybe you would... but normal people with normal English would not.
otabdeveloper4
5 days ago
Self-reflection is not the same thing as self-awareness.
Computers have self-reflection to a degree - e.g., they react to malfunctions and can evaluate their own behavior. LLMs can't do this, in this respect they are even less of a thinking machine than plain old dumb software.
baq
6 days ago
Technically correct and completely besides the point.
K0balt
6 days ago
People cant fly.
jillesvangurp
6 days ago
I think speed and convenience are essential. I use chat gpt desktop for coding. Not because it's the best but because it's fast and easy and doesn't interrupt my flow too much. I mostly stick to the 4o model. I only use the o3 model when I really have to. Because at that point getting an answer is slooooow. 4o is more than good enough most of the time.
And more importantly it's a simple option+shift+1 away. I simply type something like "fix that" and it has all the context it needs to do its thing. Because it connects to my IDE and sees my open editor and the highlighted line of code that is bothering me. If I don't like the answer, I might escalate to o3 sometimes. Other models might be better but they don't have the same UX. Claude desktop is pretty terrible, for example. I'm sure the model is great. But if I have to spoon feed it everything it's going to annoy me.
What I'd love is for smaller faster models to be used by default and for them to escalate to slower more capable models on a need to have basis only. Using something like o3 by default makes no sense. I don't want to have to think about which model is optimal for what question. The problem of figuring out what model is best to use is a much simpler one than answering my questions. And automating that decision opens the doors to having a multitude of specialized models.
matznerd
6 days ago
You're missing that Claude desktop has MCP servers, which can extend it to do a lot more, including much better real life "out of the box" uses. You can do things like use Obsidian as a filesystem or connect to local databases to really extend the abilities. You can also read and write to github directly and bring in all sorts of other tools.
kazinator
6 days ago
> Not sure if I would tradeoff speed for accuracy.
Are you, though?
There are obvious examples of obtaining speed without losing accuracy, like using a faster processor with bigger caches, or more processors.
Or optimizing something without changing semantics, or the safety profile.
Slow can be unreliable; a 10 gigabit ethernet can be more reliable than a 110 baud acoustically-coupled modem in mean time between accidental bit flips.
Here, the technique is different, so it is apples to oranges.
Could you tune the LLM paradigm so that it gets the same speed, and how accurate would it be?
otabdeveloper4
6 days ago
> I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.
Or just save yourself the time and money and code it yourself like it's 2020.
(Unless it's your employer paying for this waste, in which case go for it, I guess.)
cedws
6 days ago
You left an LLM to code for two hours and then were surprised when you had to spend a significant amount of time more cleaning up after it?
Is this really what people are doing these days?
thomastjeffery
6 days ago
Accuracy is a myth.
These models do not reason. They do not calculate. They perform no objectivity whatsoever.
Instead, these models show us what is most statistically familiar. The result is usually objectively sound, or at least close enough that we can rewrite it as something that is.
kittikitti
6 days ago
I don't use the best available models for prototyping because it can be expensive or more time consuming. This innovation makes prototyping faster and practicing prompts on slightly lower accuracy models can provide more realistic expectations.
g-mork
6 days ago
The excitement for me is the implications for lower energy models. Tech like this could thoroughly break the Nvidia stranglehold at least for some segments