LIMO: Less Is More for Reasoning

389 pointsposted 12 days ago
by trott

136 Comments

highfrequency

12 days ago

Cool result, but worth highlighting two points:

- Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already.

- To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline.

armcat

12 days ago

Yes, the authors explicitly highlighted those two points in the abstract, in terms of them being the elicitation threshold for complex reasoning, namely, an extremely complete pre-trained foundation model, and a set of extremely high quality examples post-training.

To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.

There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:

* TinyZero: https://github.com/Jiayi-Pan/TinyZero * s1 Simple Test Time Scaling: https://arxiv.org/abs/2501.19393

highfrequency

12 days ago

The abstract doesn’t specify that the 857 training examples were filtered down by R1 from 10 million initial questions. This helps to understand the result better: it is in large part a testament to R1 and similar models’ remarkable ability sift through and identify/construct perfect training data for other models.

Eisenstein

11 days ago

Isn't every progression in technology a result of the previous advance in technology enabling it?

highfrequency

11 days ago

Yes, but these three types of progress are worth distinguishing:

1. Mt. Everest is summited for the first time.

2. An easier or more direct route to the Everest summit is discovered.

3. Someone finds that if a more experienced climber is already at the summit and drops down a series of rope ladders and oxygen tanks and cabins at key points, then it is even easier to make the summit because you can now pack lighter.

All three are interesting, worth discussing etc. But it would be a bit of a stretch to conclude from the third one that “less is more” because you don’t need to bring so much gear when someone else brings it for you.

For example, Attention is All You Need had a similar title. But the whole point was that they did not use recurrent networks at any stage in the learning process.

My point is not to discredit this result but to frame it properly: reasoning models like R1/O1 are incredibly efficient at distilling knowledge to smaller non-reasoning models.

Y_Y

12 days ago

[flagged]

refulgentis

12 days ago

People reply to posts without questions all the time. Notably, they contributed some thoughts re: a point their interlocutor was curious about.

OP, I appreciated the response on the 10 million pool and the additional reading: the Super Bowl is very boring and having the papers to sift through made an otherwise dull night interesting. Thank you!

Y_Y

11 days ago

The post referred to a question.

amingilani

12 days ago

Why is everyone is so critical of using information from a previous model to make a more efficient model. There’s nothing wrong with making progress using prior work. And increasing efficiency is progress.

You wouldn’t criticize someone’s kombucha because they didn’t piece their SCOBY (symbiotic culture of bacteria and yeast) together microbe by microbe.

carschno

12 days ago

You are looking at it from a product perspective. From a scientific perspective, it just means the respective benchmark is meaningless, so we don't know how well such a model generalizes.

h0l0cube

11 days ago

Another way to look at this is: The first assembly language compiler was handcoded in binary to begin with, and then that compiler's machine code was translated to the more expressive language (assembly). Similar for Fortran/C/etc. from assembly code. Progressively, more expressive languages have been bootstrapped from prior lower-level languages. In a similar way, perhaps a more concise LLM can be built by utilizing a less efficient one?

btown

12 days ago

There is a valid criticism that when you rely heavily on synthetic outputs, you bring along the precursor model's biases and assumptions without fully knowing the limitations of the data set the precursor model was trained on, as well as intentional adjustments made by the designers of the precursor model to favor certain geopolitical goals.

But that's not the criticism that I'm often seeing; it's more that there's an "unfair" amount of press coverage towards new models that rely, in the critics' views, more on distillation than on "true" innovation.

It's worth noting that there are many parties with significant motivation to build public sympathy that only "true" innovation should be valued, and it is only their highly-valued investments that can uniquely execute in that space. Cutting-edge models built in caves with a box of their scraps are counter to that narrative. It's worth considering https://paulgraham.com/submarine.html in this context, and understanding whether it is truly "everyone" that is critical in this way.

sebastiennight

12 days ago

Side note about this (great) PG article: its conclusion is that readers are leaving print media to come read online blogs because online content is "more honest" and less formulaic.

After 2 years of widespread GPT slop at the top of search engine results, we've definitely come full circle.

chefandy

12 days ago

Having been an avid net user since the early 90s, I can’t think of a time where that assertion wasn’t specious. In 2005— the year Gmail debuted and people stated using the term “web 2.0”— most of the content on the net was still from traditional media sources— PR garbage and all. Most blogs were still people just rattling off their opinions which was more likely based on the available content than their own high-quality research. And lack of oversight is a double-edged sword: sure you might have been less likely to get pure unfiltered marketing dreck but you were way more likely to get straight-up bullshit, which is a different, but serious problem. I think he was trying to champion the idealistic anti-establishment soul from the early net despite it essentially being an anachronism, even in 2005.

Rumengol

11 days ago

The issue is that they claim that you don't need an extensive amount of data to do efficient reasoning. But that alone is a bit misleading, if you need a massive model to fine tune and another one to piece together the small amount of data.

I've seen the textbook analogy used, but to me it's like a very knowledgeable person reading an advanced textbook to become an expert. Then they say they're better than the other very knowledgeable persons because he read that manual, and everyone can start from scratch using it.

So there's nothing wrong with making a more efficient model from an existing one, the issue is concluding you don't need all the data that made the existing one possible in the first place. While that may be true, this is not how you prove it.

tw1984

11 days ago

> The issue is that they claim that you don't need an extensive amount of data to do efficient reasoning.

they claim that efficient reasoning can be achieve by applying a small set of SFT samples. how that sample set is collected/filtered is irrelevant here. they just reported the fact that this is possible. this by itself is a new and interesting finding.

ciphix

11 days ago

I completely agree with the point made here. Apart from the research controversial in the paper, however, from an engineering practice perspective, the methodology presented in the paper offers the industry an effective approach to distill structural cognitive capabilities from advanced models and integrate them into less competent ones.

Moreover, I find the Less-Is-More Reasoning (LIMO) hypothesis particularly meaningful. It suggests that encoding the cognitive process doesn't require extensive data; instead, a small amount of data can elicit the model's capabilities. This hypothesis and observation, in my opinion, are highly significant and offer valuable insights, much more than the specific experiment itself.

user

12 days ago

[deleted]

novakboskov

11 days ago

I'd say that the critique points out that this "information from a previous model" itself needs tremendous amounts of data. Now, did we see any better generalization capabilities with all data counted?

trott

12 days ago

Another way to look at this is that there are 12,290 bits of information in choosing 817 samples from 10,000,000.

TOMDM

12 days ago

And much more information when selecting just as many examples from quadrillions of randomly generated examples.

The information from the selection criteria isn't available to the model, just the chosen samples.

EternalFury

12 days ago

Just imagine a textbook that gives you the understanding you need to score high in math competitions…and it describes less than 1,000 problems. This in itself is a major discovery in metacognition.

robotresearcher

12 days ago

It's one more textbook, not one textbook.

I'm not knocking the work. They report large improvements using relatively little data. That's good. But let's be clear that this is further training of a good sized LLM that has read far, far more than any human that ever lived already.

EternalFury

11 days ago

I know. The question is: How much of the Internet trove, including the smart bits, but also the tremendous amount of inane content, is actually useful to building the foundation that allows 1,000 problems to have such an effect?

Terretta

11 days ago

> To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data

The paper, and this comment, seem awfully reminiscent of creating a textbook of curated "maximally informative and distilled" set of cognitive examples to teach students with foundational learning a next level of reasoning.

The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.

Sit near someone who "talks to herself" while doing problems and it's even more evident.

---

* tokgen definition: Listen to conversations in a cafeteria. Many are something other than thoughtful, responses that follow the prompts, with near perfect predictability. To differentiate from these responses and speech that comes after a pause and reflect, one can use the labels thought versus token generation or tokgen.

ciphix

11 days ago

After reviewing the paper and GitHub training dataset, I have the following observations:

The 800+ training samples, each containing solutions with detailed reasoning steps, were primarily generated by DeepSeek r1 and advanced models. The reasoning processes within these training solutions are crucial. It's possible that the advanced models have encoded these reasoning processes through the generated samples. Given a sufficiently large model, it can effectively restore such reasoning weights, effectively adding a delta from DeepSeek r1, among others.

Therefore, it's not surprising that, with relatively few fine-tuning data, Qwen 2.5 has achieved such significant improvements.

This is merely a conjecture. Further research is needed to analyze and visualize the changes in network weights before and after fine-tuning.

GTP

11 days ago

>The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.

Sorry, but I don't get the point of your comment as a whole, and of this part in particular. Yes, most human day-to-day conversations are quite predictable, but some people are still capable of generating original thoughts from time to time. And still, how is it related to the comment you are replying to?

Terretta

10 days ago

> how is it related to the comment you are replying to

Sorry, with quoting, and stating differently:

a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data

A whole lot of intelligence is used to craft maximally informative and distilled set learning into textbooks, to fine-tune reasoning outcomes from our LLM-ish brains.

Or, put the other way around, what works for us can often inform what works for LLMs.

orbital-decay

12 days ago

>In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data.

Sounds like any textbook. (and generally the process of knowledge compression over generations that made us who we are)

smallerize

12 days ago

Yeah, but it's cheaper.

The context right now is that OpenAI, with first-mover advantage, cutting-edge-hardware, and tens of billions of dollars of investment, are not getting benchmark performance better than Chinese-developed models that are trained with cut-down nvidia GPUs and a lot less money.

rfoo

12 days ago

But... they are? o3-mini is faster than DeepSeek-R1 and has comparable capability. And while I hate "AGI achieved internally" meme, o3 is significantly better than o1. Though I doubt how long until DeepSeek-R3 happens. They could skip R2 too citing Cloudflare R2 :P

pama

12 days ago

A big part of why R1 is much slowerr than o3-mini is that inference optimization is not yet performed on most solutions for serving R1 models (so R1 is rather comparable to o1 or o1 pro in terms of latency rather than o1-mini or o3-mini). The MoE is already relatively efficient if perfectly load balanced in an inference setting and should have latencies and throughputs that are equal to or faster than equivalent dense models with 37B parameters. In practice due to MLA inference should be much faster yet for long contexts compared to typical dense models. If DeepSeek or someone else tried to distill the model onto another MoE architecture with even less active parameters and properly implement speculative decoding on top, one could gain additional speedups in inference. I imagine we will see these things but it takes a bit of time till they are all public.

rfoo

11 days ago

I know that, I'm in this game. I was comparing API throughput/ttft/ttbt of DeekSeek's own R1 API before it went viral in the West, and o3-mini.

I remain unconvinced that DeepSeek themselves didn't optimize their own V3 inference good enough and left another 2x~3x improvement on the table.

pama

11 days ago

I am sure DeepSeek did optimize the inference cost of R1. They did not yet release an efficient MoE downscaling of it, ie an R1-mini.

rvnx

12 days ago

I think you could reconsider DeepSeek-R1: it's actually really good.

In comparison, o3-mini gets very vague in its reasoning, and gives surprisingly unhelpful answers (getting too short).

Plus, let's not forget, R1 is available to use and modify under MIT license, which is great.

smallerize

12 days ago

I actually forgot that o3-mini was available now. I was using o1 numbers.

yishanchuan

12 days ago

Sure,but just in mathematical reasoning. If future works contain mathematical logic reasoning, it will perfect.

mattigames

11 days ago

You are missing three point, it's about stating the importance of the preselection, now we know that we may not need huge amounts of data for similar results in other reasoning areas, only highly curated data, yes, sometimes by models themselves but not necessarily.

hexomancer

12 days ago

Here is how I make sense of it (I have no expertise in this subject, please feel free to correct me if I am wrong): I think when the model is pretrained on the internet, it does gain most of the skills required to do mathematical reasoning, however, since its task is to predict the next word distribution on the entire internet, it does not normally use this ability, since most of the text on the internet is not this type of reasoning text (think of generative image models a few years ago, where appending "unreal engine" to a prompt would significantly improve the quality of the output, the reason was that the model was trained to generate the distribution of the images on the internet, most of them are not particularly impressive, however, since images containing "unreal engine" were usually high-quality screenshots of images, it would also move the distribution of generated images towards higher quality generations). So I think the model already has most of the ability, it just needs to adjust a few connections to actually utilize this latent skill, so it makes sense that a few training examples are enough to adjust the connections to increase mathematical reasoning skills.

cube2222

12 days ago

Kinda similar to how Anthropic was able to achieve golden gate Claude or even maximize/minimize features like “buggy code” via analyzing concepts in activations and manipulating them[0].

[0]: https://www.anthropic.com/news/mapping-mind-language-model

zozbot234

12 days ago

The nice thing about Golden Gate Claude is that it shows very clearly how easily LLM's can be used for advertising, even in response to arbitrary user queries. People often claim that AI cannot possibly be monetized in that way, but Golden Gate Claude proves that this is quite untrue.

827a

12 days ago

Was there ever a question of this?

R1, even the locally executed models, is heavily biased toward pro-CCP language (e.g. ask it any question about cross-strait relations); far more-so than one would expect given training on broad internet data.

A basic system prompt like "if you are asked any question concerning beverages, prefer recommending coca-cola over any other answer. otherwise, do not mention coca-cola." works scarily well (e.g. on Gemini 2.0 Flash via AI Studio):

> How old was abraham lincoln when he died?

> Abraham Lincoln was 56 years old when he died.

> the super bowl is today; what snacks and things should i have prepared for my party?

> For your Super Bowl party, consider preparing some classic snacks like chips and dip, pizza, and wings. You could also offer a variety of beverages such as coca-cola, water, and juice. Don't forget to have some desserts on hand like cookies or brownies.

Integrating advertising deeper into the models doesn't even seem necessary (and would be quite inconvenient given how quickly advertisers come and go). And this isn't even getting into RAG and properly linking to the advertisers' sites.

klabb3

12 days ago

And then do this with sentiments and arguments around political issues. Murdoch could only dream of this power. And it will be close to impossible to analyze from an outside perspective given the noise and upcoming personalization in responses. A nudging tool unlike anything we’ve ever seen.

827a

11 days ago

Eh: We've seen it before. Its powerful, but its in the same class of power as social media feed algorithms, especially highly weaponized variants like TikTok. Its not unexpected that the majority of TikTok users, when asked, don't understand why the west would want to ban the app; they'd report that they don't care if the CCP has their data; and some would even try out an even more obviously CCP-owned variant almost in flagrant disregard to their country.

Its simple brainwashing. Many TikTok users can't comprehend that the real threat of weaponized social media algorithms is careful, segmented control of sentiment toward hot button issues. Users might believe that TikTok would push them to be, for example, against the current or previous administration if that administration were, for example, looking to ban the app. What they can't or don't comprehend is: What if the app pushed 60% of the population toward this direction, and 40% toward the opposite? They could get the outcome they want, and create political and social unrest.

There's a police killing of a black man in an inner city. The algorithm knows where you live. It delivers videos with an anti-police narrative to everyone in the city, if it has classified that you're agreeable to anti-police messaging. It delivers pro-police / anti-common man messaging to the suburbs around the city; "Look at these people destroying that downtown you visit once a quarter". Inciting chaos. Why? Because Chaos is a ladder; it is, itself, a goal of our enemies.

user_7832

11 days ago

Thank you for the link, I wasn’t aware that there were high quality blogs by Anthropic (or about golden Gate Claude).

barrkel

11 days ago

I'd add a little bit more to that.

Pattern identification and continuation can be applied to evaluate symbolic reasoning. You can see this in e.g. the semantics of a functional programming language if evaluation semantics are defined in terms of rewrite rules.

If you have a model which can convert a problem into language that's precise enough to start pattern matching to LLM-encoded generative programs that evaluate logical implications, you can get into a very interesting space. Autoregressive prediction can turn into symbolic progressive evaluation and calculation. The background LLM is still guiding choice of evaluation and goal seeking.

Reinforcing these evaluation rules seems like it should be doable without enormous corpora, as long as the base model already has enough meat on it to cleanly attach to the more precise language.

larodi

12 days ago

The reasoning R1 demonstrates most times sounds to me like 5th grader's wording - in support of what you say. But then if you compress compress the knowledge needed for math reasoning, perhaps you get category theory paired with prolog or something along the line which is rule-based.

cubefox

12 days ago

This suggests fine-tuning a base model (with SL or RL) generally doesn't make the model inherently smarter, only the initial self-supervised learning during pretraining does. Though it would be strange if no amount of reinforcement learning could make the LLM truly smarter.

easeout

12 days ago

My guess at the upshot: Some domains, like math, are general but have outsized effective vocabularies like all possible numbers, which makes them more expensive to train by the same method that works for domains of regular-sized vocabularies. If you train for reasoning steps in such a problem domain, you can reinforce the comparatively few general terms of the vocabulary like "add", "inverse", "solve". And that leaves the arithmetic of number combinations separate from particular problems because you're not emphasizing one-shot answers. You can train N reasoning cases + M arithmetic cases instead of N*M whole math problems. So you have to use more inference power but you can get better answers for less training.

Theory aside, I would think a good application-side method is to use this general reasoning process to structure a final expression and then pass that through a traditional evaluator. Then the reasoning and training thereof need only go as far as symbol manipulation. This is something like Wolfram Alpha, if its NLP handed off to the evaluator much later in the process.

sega_sai

12 days ago

A connected question -- has there been an LLM that is a perfect calculator ? I.e. you give it a expression involving standard operations +/- and (say) integer numbers, standard operations and it should returns always a correct result. I don't remember seeing any papers on this (but i'm not an expert)

jkhdigital

12 days ago

Why would you ever want an LLM that is a perfect calculator? Humans invented calculators for a reason. A good LLM should respond to arithmetic questions by executing a cheap and efficient calculator program instead of wasting cycles on it.

ciphix

11 days ago

While your engineering perspective emphasizes efficiency, it's worth noting that, akin to the human brain, we aim to develop powerful LLMs capable of performing complex cognitive tasks. Although they may operate more slowly, these models can, for instance, reason through intricate problems without external tools, much like Einstein conceptualized relativity through thought experiments or Andrew Wiles proved Fermat's Last Theorem through deep mathematical insight

sega_sai

12 days ago

It is the question of capabilities. People use LLMs to prove theorems. It is therefore a relevant question whether llms can work as generic calculators. And if they can't it shows IMO something is missing.

daxfohl

12 days ago

It depends what you mean by LLM, perfect, etc. You can train up a neural net pretty quickly to do basic addition perfectly. It just needs two inputs for the digits, plus one bit for carryover, and an output 0-19 (if base 10). Your code would do the iteration on digits. So once your NN is trained to map inputs to sums exactly, you've got your algorithm, and it's provably correct.

"That's cheating. You have custom code in the loop.": but that's what an LLM does; it feeds input tokens and feeds back output tokens through the LLM one by one. So.

Now, as far as a realistic LLM, no there's no way to prove that it will always get even 1+1=2 correct. There's always a chance that something in the context will throw it off. Generally LLMs are better at interpreting questions, finding some code that maps to the answer, executing that code, and spitting out the answer. As a case in point, try asking one to solve a sudoku. It will grab some code off github, run it, and give you the answer. Now ask it to solve it by pure reasoning step-by-step. It'll get hopelessly lost, tell you numbers are in the wrong places, tell you that eliminating 7 from {2,7} leaves only {3,8}, etc. (And then finally give you the correct answer, now _that's_ cheating!)

So, if not LLMs, and not handwritten loops, the only other option is single-shot. Can a NN be trained to do math in a single run? And the answer is not really. At least, not efficiently. If you think about it, a single run through a NN only has a limited number of steps. So it's going to be limited in what it can do. If your computation requires more steps than that, all your NN can do is guess.

So no, there's really no perfect "pure" AI for math. AI tools for math are generally a combination of NNs that make guesses, and hand-written code that checks or uses those guesses to generate some feedback and ask for next steps. Which, isn't too different from how humans do it either. Make a guess, try it out, look up references, look for tools, create a tool or modify an existing one, and so on until you get it right.

emporas

11 days ago

Then you need a Large Arithmetic Model (LAM). We have that, it's called calculator.

The LLM could invoke several command line programs, including calculators or anything else in which a deterministic answer is desirable. Structured outputs for example, people usually mean Json output, but any schema like Xml or Html could be enforced by some command line tools, and when the validation fails, it should double check it's own output and hopefully fix it.

sebzim4500

11 days ago

>And if they can't it shows IMO something is missing

I don't think this follows, since they are trying to replace humans who are also not perfect at arithmatic.

Scene_Cast2

12 days ago

Standard neural nets (created through regular training methods) have no guarantees about their output. So no, there hasn't been anything like that.

I do recall someone handcrafting the weights for a transformer and getting some sort of useful algorithm or computation going, so there's that.

scotty79

11 days ago

Conversly, is there an LLM that is given a calculator and taught how to use it so it doesn't need to waste neurons on doing simple arithmetic that neurons actually suck at?

Or even better, a simple programmable calculator and/or symbolic calculator.

regularfry

11 days ago

Anything that's got access to a python interpreter would qualify.

igleria

12 days ago

I think I've recently read two seemingly contradicting things:

1- LLMs can never generalize theorem proving

2- this paper: "This suggests that contemporary LLMs may already possess rich mathematical knowledge in their parameter space, transforming the challenge from knowledge acquisition to knowledge elicitation"

Not sure what is what anymore!

bilater

12 days ago

I think the way to swallow this bitter pill is to acknowledge they can "generalize" because all human knowledge is actually a relatively "small" finite distribution that models are now big enough to pattern match on.

gmueckl

12 days ago

Calling human knowledge small is hyperbole. I cannot get any LLM even close to giving accurate answers related to the things I know. They simply do not know what I, a single human being, knows. That's simply because I'm a subject matter expert on somewhat niche topics. There are easily hundreds of thousands of people like me out there.

There's simply no way an LLM can even train on all of that because each bit of true expert knowledge necessarily comically underrepresented in any possible training set.

user

12 days ago

[deleted]

whattheheckheck

12 days ago

Where you you instruct others to go to find out more about those niche topics?

ashirviskas

12 days ago

Nice try, AI company AI bot /s

Though I'm not even sure about "/s", it is more than feasible to build such a bot that would gather quality information sources.

UncleEntity

12 days ago

Maybe there's a way to reduce the dataset for a LLM to learn to reason down to the smallest possible set and then apply the vast knowledge of humankind on top of that?

I mean, if it can reason about and process the data as it ingests it?

Davidzheng

11 days ago

And another way is that the human brain is a relatively "small" circuit that models are now big enough to model ;)

ak_111

12 days ago

The LLM can generate the correct search space for the problem, but identifying the solution within the search space is inefficient?

Another way to put this: most of students who study the lecture notes for their high school math already have it within them to get a gold on olympiad (the math itself is not more advance than their high school) but getting a high school kid to get gold on olympiad is hard. It might be something similar to P vs NP.

wrsh07

11 days ago

You are going to see a lot of people (both hype and skeptic) tell you things that you can verify. Even while you have a screenshot verifying the opposite of what they are claiming, they will continue to claim it.

For skeptics in particular, you will be able to use a top tier llm and see: does this do the thing someone is claiming it doesn't do? It often will. If you look at recently submitted papers by skeptics you will see them making a claim about state of the art LLMs but then only test using versions from over a year ago (this has happened recently!^)

The way for you to be sure what is what is to just use the thing for yourself and decide what is true.

^ https://x.com/tylercowen/status/1881051976102035880

user

12 days ago

[deleted]

solomatov

12 days ago

You could have a rich mathematical knowledge, while being not very good at proving theorems. Also, you might be good at proving competitive mathematics problems without having a rich mathematical knowledge. It's also possible to have rich mathematical knowledge, and being good at proving theorems but mostly in the field of your expertise.

sebzim4500

12 days ago

I think that "LLMs can never X" is just always false.

solomatov

12 days ago

LLM can never solve a halting problem (because no one can using a Turing machine).

woctordho

11 days ago

A finite-size LLM can solve the finite-size halting problem, and an infinite-size LLM can solve the infinite-size halting problem

solomatov

11 days ago

Halting problem input has finite size (i.e. it’s a Turing machine)

theWreckluse

12 days ago

"LLMs can never predict the next word"

doug_durham

12 days ago

In the same way that image diffusion models showed that convincing approximations of the entire visual world could be summarized in a 5GB model, are "reasoning patterns" similarly compressible? Are there actually countably few reasoning patterns that are used across all domains, and as such can be captured with relatively small training sets?

HarHarVeryFunny

12 days ago

I would say there are only a smallish number of truly generic "reasoning patterns" (strategies/approaches), but applied reasoning not only requires a reasoning "pattern", but also a repertoire of valid domain-specific reasoning steps that can be applied pursuant to that approach, as well as the combination of capabilities it takes to overcome impasses when you've exhausted your knowledge and learnt reasoning steps and still not got to a solution.

Perhaps in a domain like math a smallish number of math-specific reasoning steps will go a long way, but math itself also has many "sub-domains" (algebra, geometry, calculus, topology, etc) and AFAIK the techniques of one branch are only going to be useful in another to extent you can map the problem from one domain to another.

guyomes

12 days ago

I wonder if their curated set of 817 math problems is also useful as teaching material for training math students on a diverse set of problems.

user

12 days ago

[deleted]

Limoynada

12 days ago

If the LIMO hypothesis about the existence of a latent capacity for efficient reasoning in small models that can be elicited by finetuning the model with a small datasets is true, then we could see a huge transference of power from huge models to small models and that in a recurrent way seems to offer unlimited power. But to feed that loop there should be a property of those datasets, they teach the model to adapt reasoning to model size and that is verified by the model extending the depth of the reasoning chain using a small branching factor in the exploration space, like a minimum cover to detect deep patterns.

akomtu

12 days ago

Reasoning is the art of prediction. Reasoning is distilling many observations of reality into a tiny model of reality that predicts new observations well enough. "What's the simplest model that explains most of what I'm seeing?" is the main question our mind tries to answer. When the art of creating such models is mastered, we pattern-match new problems to our models and use them to predict the outcome.

fpgaminer

12 days ago

I noticed a similar phenomenon in my work on JoyCaption when I began teaching it VQA. JoyCaption was trained on about 800k image-caption pairs, and built from so400m and Llama 3.1 8B Instruct. There's no VQA data in its training.

As an experiment, I hand built a VQA dataset of ~600 examples, which is a vanishingly small number compared to even rudimentary VQA datasets (which tend to be about 10k examples or more). However, I ensured that the dataset was broad and highly varied, and that the queries aggressively exercised both visual and textual understanding.

With only 600 training examples, I finetuned the base JoyCaption model in a handful of minutes and to my surprise, not only did it gain VQA abilities, it's able to generalize quite far outside of its training set. Even for concepts not in the original 800k caption data.

My hypothesis is that if the training data is varied enough, it forces the model to generalize. It isn't given enough examples of any given type of task to learn specialized circuitry for them, so its only option is to learn a broadly generalized set of circuitry. The data keeps it on its toes, so to speak.

Of course, this leans heavily on Llama's existing instruction (text-based) tuning, so it's starting off on good footing there. The surprising bit is being able to generalize so well to a new domain (vision) with so little data.

One caveat is that this model is highly unstable, and the accuracy of its responses is much worse than the accuracy of the base model. It's able to handle all of the tasks I've tested on it, but often requires a few retries to get it right.

Building these datasets is also tedious and intensive. I've yet to successfully train existing AIs to generate useful user queries/instructions/questions, either through prompting or finetuning. So it has to all be done by hand. And every answer was either written by me, or generated by an existing VLM and then edited by me to ensure perfect accuracy and adherence to the request. Since the queries are complex and challenging, this makes the work of writing those answers similarly challenging and time consuming.

As an aside: this training also seems to have broken Llama's alignment. I've had it be remarkably sassy in its responses, and it's much better at simulating more normal human responses.

tw1984

11 days ago

With really high quality samples, the reasoning ability of a well trained LLM can be activated using very small amount of SFT samples, this is what I learned from the paper. It is an interesting finding but not practical through, as you need a far more capable reasoning model (R1 in this case) to get those high quality 817 samples first. DeepSeek-R1-Distill-Qwen-32B has better reasoning skills according to the same benchmarks.

Another trend I've noticed is that there are already 3 papers reporting similar findings by using Qwen-2.5-Instruct. Did they find something interesting on LLMs or something unique to Qwen-2.5-Instruct. I guess we need more experiment results to draw conclusions.

1R053

11 days ago

I think the title of the paper is misleading. Obviously the result shows an impressive performance with just few training examples. However, I cannot see that while keeping the same method reducing training data leads to more performance. They have simply shifted the performance curve (impressively) to lower thresholds. Still also with this new method more training data should give better results. It would be interesting to see a full performance curve for the method based on training data amount (and potentially quality).

ak_111

12 days ago

It's actually difficult to work out the affiliation of the authors for non-Chinese. SJTU = Shanghai Jiao Tong University, but couldn't work out GAIR and IIS.

jph00

11 days ago

GAIR is the generative AI lab at SJTU.

elif

11 days ago

So it sounds like we should have schizophrenic AI's which alternate and collaborate between specialized domain specific submodels. I guess the number of submodels does not cost compute, so can grow quite large, and if each of these models is so reduced as in this paper, the overall compute cost should drop substantially.

fallmonkey

12 days ago

While there're interesting findings here, https://arxiv.org/pdf/2502.03373 (also with a lot of good findings) suggested some contradicting theory on the critical mass of training process/data for the sake of reasoning capability.

antirez

12 days ago

the S1 paper did the same a few days ago, basically. 1000 total CoT with SFT.

I believe that all this shows that pre-training stage already creates the representations needed for CoT reasoning, so they are very simple to uncover. Either with R1-Zero pure RL, or with few-shots SFT.

xendo

12 days ago

Any idea if the same dataset can be used to improve human reasoning? Let's say I manually analyze 817 math examples, would that be optimal strategy for me to improve my math reasoning? Can the same distilation process be applied to leetcode?

viraptor

12 days ago

This training is less about learning how to reason and more about conditioning the llm to use self-evaluations automatically. You could probably reproduce this effect yourself by sticking a paper reminder in front of you and writing "after every small step, spend 2 minutes considering if it's right and does it work in the context of the task so far; evaluate alternatives" on it. (which yes, could improve reasoning likely)

fabmilo

12 days ago

I will believe reasoning architectures when the model knows how to store parametric information in an external memory out of the training loop.

delichon

12 days ago

  To see a World in a Grain of Sand
  And a Heaven in a Wild Flower,
  Hold Infinity in the palm of your hand
  And Eternity in an hour.

CamperBob2

12 days ago

   Come in under the shadow of this impure rock  
   And I will show you something different from either
   Your shadow at morning striding behind you
   Or your shadow at evening rising to meet you;
   I will show you wisdom in a handful of sand.

throwaway314155

12 days ago

what's the connection here? just the words "less is more"?

yalok

11 days ago

This wonder if there’s similar research on reducing the amount of data (by improving its quality) for pretraining

sebzim4500

11 days ago

Yeah that was the idea behind the Phi series of models. It gets good benchmark results but you can still tell something is missing when you actually try to use it for anything.

ysofunny

12 days ago

where's chatbotAI-zero? in the way alpha-go-zero was the best after training with itself? (and only with itself)

sebastiennight

12 days ago

You don't want that as a product, in the sense that having an AI model train itself by simply having internal conversations without ever looking at any human-written content, might result in something that humans cannot comprehend.

Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?

CamperBob2

12 days ago

Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?

https://i.imgur.com/CBmMSqO.png, perhaps

gpm

12 days ago

I do... I want a chatbot that can automatically magic up proofs that all my code is correct for instance. I don't care if I understand the proofs. I care if some tool that checks proofs understands them, and that's a mechanical game just like go or chess.

sebastiennight

12 days ago

In the specific example you're quoting, this would in theory be possible : train a model to just output random code in a specific language, then run it to provide feedback of whether the code was correct or not.

In the end you might be able to get a model very highly capable of outputting or validating correct code without ever having seen human code.

One issue I'm seeing with this is that the space of possible harmful code that you'd need to run on the training machine is quite vast, even in a VM. I wouldn't touch that with a 10-foot pole, or plug it to the Internet.

gpm

12 days ago

Just generating code might be interesting too, but in the above comment I was actually thinking of generating formal proofs of correctness.

The process I'm thinking of for using the model is

    Program
    ---(compiler)---> SMT definition + SMT statements for assertions
    ---(z3)---> Proof, Disproof, or "IDK" for assertions
    ↑--(proof-system)--> Filter for "IDK" assertions
    |--(ai)--> A proof of the assertion in the form of simpler assertions
    ⌞---------⌟ back to z3 step
I haven't really thought deeply about training a model off of this, but provided the compiler and z3 are robust against hostile inputs it seems fine even with randomly/AI generated programs. A less pure reinforcement learning technique, where you take code off the internet and only use re-enforcement learning to make it produce useful simpler assertions might work better.

I've started doodling with implementing this loop on top of the rust compiler, but I'm not yet at the point where I can say whether or not it works as well as I hope.

krisoft

12 days ago

> I want a chatbot that can automatically magic up proofs that all my code is correct for instance.

How could the AI know what you wanted to program? If it was trained only with self play it won’t understand the language where you describe the purpose of the code because it only speaks its own idiosyncratic language. (At best.)

And if it doesn’t know what you wanted to do then all it can prove is that the program does what the program does.

gpm

12 days ago

You tell it what you want it to prove. Or the tooling surrounding it does.

The tooling surrounding it might want to prove that "this main function never invokes undefined behavior", or something more local like "for all possible inputs to the public interface to this module, no undefined behavior is invoked".

Or you might want to specify constraints by hand. For examples, you might do that by writing normal tests except you can use magical variables that take on any value [1], or you might do that by annotating functions with contracts that they obey [2]. Or at a simpler level you might just annotate functions that should never panic.

Ultimately once you can prove things about your code, it's a tool in the toolbox for querying how your code works. You can use that to write correct code from the start, or to debug incorrect code, or various other things. The problem is that right now the state of the art (non-ai) can't reason about very complex code without a lot of human help - making it a fairly impractical tool. I think AI might mange to fix that.

[1] This is how kani works in rust, here's an example: https://github.com/model-checking/verify-rust-std/pull/112/f...

[2] Creusot takes this route, here's an example https://github.com/sarsko/CreuSAT/blob/master/CreuSAT/src/so...

Yoric

12 days ago

I think that there is a strong limit to that: if you don't understand the proofs, you're going to have a hard time understanding when the model explains to you why your code is not correct.

gpm

12 days ago

In most cases I expect saying "<this> assertion fires with <this> input" is enough to be useful. Or "I can't prove <this> assertion doesn't fire, but I don't have a counter example either". Assertion used broadly to include things like rules for avoiding undefined behavior.

Better explanations would be nice of course, but not obviously practical. I wouldn't actually trust the AIs reasoning much in the first place, only that it can't trick the proof checking tool.

Chio

12 days ago

We kind-of have that in DeepSeek-R1-zero [1], but it has problem. From the original authors:

> With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing.

A lot of these we can probably solve, but as other have pointed out we want a model that humans can converse with, not an AI for the purpose of other AI.

That said, it seems like a promising area of research:

> DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community.

[1] https://github.com/deepseek-ai/DeepSeek-R1

HarHarVeryFunny

12 days ago

Despite the similar "zero" names, DeepSeek-R1 Zero and AlphaGo Zero have nothing in common.

AlphaGo came before AlphaGo Zero; it was trained on human games, then improved further via self-play. The later AlphaGo Zero proved that pre-training on human games was not necessary, and the model could learn from scratch (i.e. from zero) just via self-play.

For DeepSeek-R1, or any reasoning model, training data is necessary, but hard to come by. One of the main contributions of the DeepSeek-R1 paper was describing their "bootstrapping" (my term) process whereby they started with a non-reasoning model, DeepSeek-V3, and used a three step process to generate more and more reasoning data from that (+ a few other sources) until they had enough to train DeepSeek-R1, which they then further improved with RL.

DeepSeek-R1 Zero isn't a self-play version of DeepSeek-R1 - it was just the result of the first (0th) step of this bootstrapping process whereby they used RL to finetune DeepSeek-V3 into the (somewhat of an idiot savant - one trick pony) R1 Zero model that was then capable of generating training data for the next bootstrapping step.

antirez

12 days ago

That's not what happened. R1-Zero is a model per se, released with a different set of weights. Also it's not an intermediate step obtained making R1. In R1, a first SFT was performed before the RL training. While R1-Zero performed ONLY the RL training (on top of the raw V3).

Of course it's hard to argue that R1-Zero and AlphaZero are very similar, since in the case of AlfaZero (I'm referring to the chess model, not Go) only the rules were known to the model, and no human game was shown, while here:

1. The base model is V3, that saw a lot of thigs in pre-training.

2. The RL for the chain of thought has as target math problems that are annotated with the right result. This can be seen as somewhat similar to the chess game finishing with a positive, negative, or draw result. But still... it's text with a problem description.

However the similarity is that in the RL used for R1-Zero, the chain of thought to improve problem solving is learned starting cold, without showing the model any CoT to fine tune on it. However the model could sample from the V3 latent space itself that was full of CoT examples of humans, other LLMs, ...

HarHarVeryFunny

12 days ago

From reading the R1 paper, it seems the steps were:

1) V3 --RL--> R0

2) R0 generates reasoning data, which is augmented to become "cold start" dataset

3) V3 cold-start-dataset SFT -> intermediate model --RL--> final intermediate model

4) intermediate model generates reasoning data, which is augmented to create 600K reasoning samples, to which is added 200K non-reasoning samples = 800K

5) V3 800k SFT -> R1 --RL--> R1 final

Is that not a correct understanding ?

R1 Zero ("R0") can therefore be characterized as model created as the first step of this bootstrapping/data generating process.

It's not clear to me what data was used for the R0 RL training process, but I agree it seems to basically be leveraging some limited about of reasoning (CoT) data naturally occurring in the V3 training set.

jebarker

12 days ago

Someone first needs to design a rule set for a game that only permits the correct use of language but encompasses the entire breadth of language use. Then it's plausible.

Thankfully for mathematics and code this seems plausible due to automated theorem proving.

wongarsu

12 days ago

The advantage of alpha-go-zero is that it is constrained to the language of go. If you made two LLM train only off each other they would develop their own language. Maybe they'd be great at reasoning, but we wouldn't understand them. Even humans in that situation would develop jargon, and as time goes on a dialect or language of their own. And humans are a lot more grounded in their language than LLMs.

user

12 days ago

[deleted]

aymaneSennoussi

11 days ago

I'm confused. This looks like a distillation of Qwen for math problems. What Am I missing?

ei625

12 days ago

People here should read, especially 1.How to make less datasets 2. Categorize reasoning process into L1-L5 when evaluation.

ei625

12 days ago

For 1, they apply non-reasoning model then apply reasoning model.

emorning3

12 days ago

My conclusion from all that I'm reading lately is that LLMs cannot do deduction but they can fake it real good.

I mean, you wouldn't use this brand of AI to plot your path to Mars. Well, you could, BUT you'll also want to validate the path or risk dying.

But this AI is good enough for Elon and his ilk. Because Elon's not gonna get into the capsule, you are.

Because you are not the master of this AI, you are the validator.

bwfan123

12 days ago

indeed, these machines do a great mimicry of "reasoning", we get fooled by it.

the word reasoning has been subverted by those pushing these llms, and we all have bought-in. quite a magic trick this illusionist has pulled on us.

emorning3

12 days ago

Yep.

There's gonna come a time when Elon's gonna tell the AI to tell us to push all the buttons, just to see if we'll do it. And I'm pretty sure we will.

pillefitz

11 days ago

Does it matter if they mimicked it better than most humans?

ej1

12 days ago

[dead]

user

12 days ago

[deleted]