Mathematical exploration and discovery at scale

172 pointsposted 8 hours ago
by nabla9

61 Comments

kzz102

3 hours ago

It's really tiring that LLM fans will claim every progress as breakthrough and go into fantasy mode on what they can do afterwards.

This is a really good example of how to use the current capabilities of LLM to help research. The gist is that they turned math problems into problems for coding agents. This uses the current capabilities of LLM very well and should find more uses in other fields. I suspect the Alpha evolve system probably also has improvements over existing agents as well. AI is making steady and impressive process every year. But it's not helpful for either the proponents or the skeptics to exaggerate their capabilities.

smokel

an hour ago

One could say the same about these kinds of comments. If you don't like the content, simply don't read it?

And to add something constructive: the timeframes for enjoying a hype cycle differ from person to person. If you are on top of things, it might be tiring, but there are still many people out there, who haven't made the connection between, in this case, LLMs and mathematics. Inspiring some people to work on this may be beneficial in the long run.

jagged-chisel

an hour ago

GP didn’t say they didn’t like it. They criticized it. These things are not the same.

Discussions critical of anything are important to true advancement of a field. Otherwise, we get a Theranos that hangs around longer and does even more damage.

nl

5 hours ago

Hopefully this will finally stop the continuing claims[1] that LLMs can only solve problems they have seen before!

If you listen carefully to the people who build LLMs it is clear that post-training RL forces them to develop a world-model that goes well beyond a "fancy Markov chain" that some seem to believe. Next step is building similar capabilities on top of models like Genie 3[2]

[1] eg https://news.ycombinator.com/item?id=45769971#45771146

[2] https://deepmind.google/discover/blog/genie-3-a-new-frontier...

woadwarrior01

3 hours ago

Please read section 2 of the paper[1] cited in the blog post. LLMs are used as a mutation function in an evolutionary loop. LLMs are certainly an enabler, but IMO, evolutionary optimization is what deserves credit in this case.

[1]: https://arxiv.org/abs/2511.02864

ants_everywhere

3 hours ago

all mathematicians and scientists work with a feedback loop. that's what the scientific method is.

omnicognate

2 hours ago

Not one that amounts to a literal, pre-supplied objective function that's run on a computer to evaluate their outputs.

ants_everywhere

2 hours ago

That's exactly how a great deal of research level math is done.

In fact all open conjectures can be cast this way: the objective function is just the function that checks whether a written proof is a valid proof of the statement.

Is there a solution to this PDE? Is there a solution to this algebraic equation? Is there an optimal solution (i.e. we add an optimality condition to the objective function). Does there exist a nontrivial zero that is not equal to 1/2, etc.

I can't tell you how many talks I've seen from mathematicians, including Fields Medal winners, that are heavily driven by computations done in Mathematica notebooks which are then cleaned up and formalized. That means that -- even for problems where we don't know the statement in advance -- the actual leg work is done via the evaluation of computable functions against a (explicit or implicit) objective function.

omnicognate

an hour ago

Existence problems are not optimisation problems and can't, AIUI, be tackled by AlphaEvolve. It needs an optimisation function that can be incrementally improved in order to work towards an optimal result, not a binary yes/no.

More importantly, a research mathematician is not trapped in a loop, mutating candidates for an evolutionary optimiser loop like the LLM is in AlphaEvolve. They have the agency to decide what questions to explore and can tackle a much broader range of tasks than well-defined optimisation problems, most of which (as the article says) can be approached using traditional optimisation techniques with similar results.

ants_everywhere

an hour ago

> Existence problems are not optimisation problems

Several of the problems were existence problems, such as finding geometric constructions.

> It needs an optimisation function that can be incrementally improved in order to work towards an optimal result, not a binary yes/no.

This is not correct. The evaluation function is arbitrary. To quote the AlphaEvolve paper:

> or example, when wishing to find largest possible graphs satisfying a given property, ℎ invokes the evolved code to generate a graph, checks whether the property holds, and then simply returns the size of the graph as the score. In more complicated cases, the function ℎ might involve performing an evolved search algorithm, or training and evaluating a machine learning model

The evaluation function is a black box that outputs metrics. The feedback that you've constructed a graph of size K with some property does not tell you what you need to do to construct a graph of size K + M with the same property.

> a research mathematician is not trapped in a loop, mutating candidates for an evolutionary optimiser loop like the LLM is in AlphaEvolve.

Yes they are in a loop called the scientific method or the research loop. They try things out and check them. This is a basic condition of anything that does research.

> They have the agency to decide what questions to explore

This is unrelated to the question of whether LLMs can solve novel problems

> most of which (as the article says) can be approached using traditional optimisation techniques with similar results.

This is a mischaracterization. The article says that an expert human working with an optimizer might achieve similar results. In practice that's how research is done by humans as I mentioned above: it is human plus computer program. The novelty here is that the LLM replaces the human expert.

omnicognate

33 minutes ago

> finding geometric constructions

Finding optimal geometric constructions. Every problem is an optimisation because AlphaEvolve is an optimiser.

> This is not correct. The evaluation function is arbitrary.

You say this and then show details of how the score is calculated. AlphaEvolve needs a number to optimise, because it is optimiser. It can't optimise true/false.

> The feedback that you've constructed a graph of size K with some property does not tell you what you need to do to construct a graph of size K + M with the same property.

The feedback that you've constructed a graph of size K tell you that you've constructed a bigger graph than a competing solution that only constructed a graph of size K-1 and are therefore a more promising starting point for the next round of mutation

If you're trying to solve a "does there exist an X" problem, the information that none of your candidates found an X doesn't give you any information about which of them you should retain for mutation in the next step. You need a problem of the form "find the best X" (or, rather "find a good X") and for that you need a score of how well you've done. If you can find a score that actually improves steadily until you find the thing you're trying to prove the existence of then great, but generally these problems are "find the best X" where it's easy to come up with a load of competing Xs.

> The novelty here is that the LLM replaces the human expert.

That's not the claim at all. Tao said the benefits are scaling, robustness and interpretability, not that it can be operated by someone who doesn't know what they're doing.

swannodette

4 hours ago

I don't see how anything about what's presented here that refutes such claims. This mostly confirms that LLM based approaches need some serious baby-sitting from experts and those experts can derive some value from them but generally with non-trivial levels of effort and non-LLM supported thinking.

dpflan

4 hours ago

Yes, applied research has yielded the modern expert system, which is really useful to experts who know what they are doing.

wizzwizz4

2 hours ago

It's not the "modern expert system", unless you're throwing away the existing definition of "expert system" entirely, and re-using the term-of-art to mean "system that has something to do with experts".

HarHarVeryFunny

an hour ago

AlphaEvolve isn't an LLM - it's an evolutionary coding agent that uses an LLM for code generation.

https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...

This is part of Google/DeepMind's "Alpha" branding (AlphaGo, AlphaZero, AlphaFold) of bespoke machine learning solutions to tough problems.

It sounds like AlphaEvolve might do well on Chollet's ARC-AGI test, where this sort of program synthesis seems to be the most successful approach.

I find Tao's use of "extremize" vs "maximize" a bit jarring - maybe this is a more normal term in mathematics?

lupire

35 minutes ago

Sometimes you want to minimize

ghm2180

2 hours ago

>.. that LLMs can only solve problems they have seen before!

This is a reductive argument. The set of problems they are solving are proposals that can be _verified_ quickly and bad solutions can be easily pruned. Software development by a human — and even more so teams — are not those kind of problems because the context cannot efficiently hold (1) Design bias of individuals (2) Slower evolution of "correct" solution and visibility over time. (3) Difficulty in "testing" proposals: You can't build 5 different types of infrastructure proposals by an LLM — which themselves are dozens of small sub proposals — _quickly_

mariusor

5 hours ago

For the less mathematically inclined of us, what is in that discussion that qualifies as a problem that has not been seen before? (I don't mean this combatively, I'd like to have a more mundane explanation)

looobay

5 hours ago

It means something that is too out-of-data. For example if you try to make an LLM write a program in a strange or very new language it will struggle in non-trivial tasks.

mariusor

5 hours ago

I understand what "a new problem for an LLM is", my question is about what in the math discussion qualifies as a one.

I see references to "improvements", "optimizing" and what I would describe as "iterating over existing solutions" work, not something that's "new". But as I'm not well versed into maths I was hoping that someone that considers the thread as definite proof for that, like parent seems to be, is capable of offering a dumbed down explanation for the five year olds among us. :)

wizzwizz4

3 hours ago

That's not what "world-model" means: see https://en.wiktionary.org/wiki/world_model. Your [2] is equivocating in an attempt to misrepresent the state-of-the-art. Genie 3 is technically impressive, don't get me wrong, but it's strictly inferior to procedural generation techniques from the 20th century, physics simulation techniques from the 20th century, and PlayStation 2-era graphics engines. (Have you seen the character models in the 2001 PS2 port of Half-Life? That's good enough.)

topaz0

5 hours ago

I think it's disingenuous to characterize these solutions as "LLMs solving problems", given the dependence on a hefty secondary apparatus to choose optimal solutions from the LLM proposals. And an important point here is that this tool does not produce any optimality proofs, so even if they do find the optimal result, you may not be any closer to showing that that's the case.

ineedasername

4 hours ago

Well, there's the goal posts moved and a Scotsman denied. It's got an infrastructure in which it operates and "didn't show its work" so it takes an F in maths.

DroneBetter

3 hours ago

well, it produced not just the solutions to the problems but also programs that generate them which can be reverse-engineered

wizzwizz4

3 hours ago

A random walk can do mathematics, with this kind of infrastructure.

Isabelle/HOL has a tool called Sledgehammer, which is the hackiest hack that ever hacked[0], basically amounting to "run a load of provers in parallel, with as much munging as it takes". (Plumbing them together is a serious research contribution, which I'm not at all belittling.) I've yet to see ChatGPT achieve anything like what it's capable of.

[0]: https://lawrencecpaulson.github.io/2022/04/13/Sledgehammer.h...

DroneBetter

2 hours ago

yeah but random walks can't improve upon the state of the art on many-dimensional numerical optimisation problems of the nature discussed here, on account of they're easy enough to to implement to have been tried already and had their usefulness exhausted; this does present a meaningful improvement over them in its domain.

ineedasername

2 hours ago

A random walk could not do the mathematics in this article-- which was essentially the entire starting point for the article.

ants_everywhere

3 hours ago

> Hopefully this will finally stop the continuing claims[1] that LLMs can only solve problems they have seen before!

The AlphaEvolve paper has been out since May. I don't think the people making these claims are necessarily primarily motivated by the accuracy of what they're saying.

piker

7 hours ago

That was dense but seemed nuanced. Anyone care to summarize for those of us who lack the mathematics nomenclature and context?

qsort

6 hours ago

I'm not claiming to be an expert, but more or less what the article says is this:

- Context: Terence Tao is one of the best mathematician alive.

- Context: AlphaEvolve is an optimization tool from Google. It differs from traditional tools because the search is guided by an LLM, whose job is to mutate a program written in a normal programming language (they used Python). Hallucinations are not a problem because the LLM is only a part of the optimization loop. If the LLM fucks up, that branch is cut.

- They tested this over a set of 67 problems, including both solved and unsolved ones.

- They find that in many cases AlphaEvolve achieves similar results to what an expert human could do with a traditional optimization software package.

- The main advantages they find are: ability to work at scale, "robustness", i.e. no need to tune the algorithm to work on different problems, better interpretability of results.

- Unsurprisingly, well-known problems likely to be in the training set quickly converged to the best known solution.

- Similarly unsurprisingly, the system was good at "exploiting bugs" in the problem specification. Imagine an underspecified unit test that the system would maliciously comply to. They note that it takes significant human effort to construct an objective function that can't be exploited in this way.

- They find the system doesn't perform as well on some areas of mathematics like analytic number theory. They conjecture that this is because those problems are less amenable to an evolutionary approach.

- In one case they could use the tool to very slightly beat an existing bound.

- In another case they took inspiration from an inferior solution produced by the tool to construct a better (entirely human-generated) one.

It's not doing the job of a mathematician by any stretch of the imagination, but to my (amateur) eye it's very impressive. Google is cooking.

omnicognate

2 hours ago

Important clarification

> search is guided by an LLM

The LLM generates candidates. The selection of candidates for the next generation is done using a supplied objective function.

This matters because the system is constrained to finding solutions that optimise the supplied objective function, i.e. to solving a specific, well-defined optimisation problem. It's not a "go forth and do maths!" instruction to the LLM.

nsoonhui

6 hours ago

>> If the LLM fucks up, that branch is cut.

Can you explain more on this? How on earth are we supposed to know LLM is hallucinating?

tux3

6 hours ago

In this case AlphaEvolve doesn't write proofs, it uses the LLM to write Python code (or any language, really) that produces some numerical inputs to a problem.

They just try out the inputs on the problem they care about. If the code gives better results, they keep it around. They actually keep a few of the previous versions that worked well as inspiration for the LLM.

If the LLM is hallucinating nonsense, it will just produce broken code that gives horrible results, and that idea will be thrown away.

empath75

10 minutes ago

The LLM basically just produces some code that either runs and produces good results or it doesn't. If it produces garbage, that is the end of the line for that branch.

qsort

6 hours ago

We don't, but the point is that it's only one part of the entire system. If you have a (human-supplied) scoring function, then even completely random mutations can serve as a mechanism to optimize: you generate a bunch, keep the better ones according to the scoring function and repeat. That would be a very basic genetic algorithm.

The LLM serves to guide the search more "intelligently" so that mutations aren't actually random but can instead draw from what the LLM "knows".

SkiFire13

5 hours ago

The final evaluation is performed with a deterministic tool that's specialized for the current domain. It doesn't care that it's getting its input from a LLM that may be allucinating.

The catch however is that this approach can only be applied to areas where you can have such an automated verification tool.

energy123

6 hours ago

Google's system is like any other optimizer, where you have a scoring function, and you keep altering the function's inputs to make the scoring function return a big number.

The difference here is the function's inputs are code instead of numbers, which makes LLMs useful because LLMs are good at altering code. So the LLM will try different candidate solutions, then Google's system will keep working on the good ones and throw away the bad ones (colloquially, "branch is cut").

ggap

3 hours ago

Exactly, he even mentioned that it's a variant of traditional optimization tool so it's not surprising to see cutting-plane methods and when the structure allows; benders decomposition

khafra

6 hours ago

Math is a verifiable domain. Translate a proof into Lean and you can check it in a non-hallucination-vulnerable way.

griffzhowl

5 hours ago

But that's not what they're doing here. They're comparing Alphaevolve's outputs numerically against a scoring function

perching_aix

3 hours ago

They did also take some of the informal proofs and formalized them using AlphaProof, emitting Lean.

griffzhowl

3 hours ago

Ah ok, I didn't notice that part, thx

ants_everywhere

2 hours ago

They put an LLM in a loop that mimics how people do real math, and it did research-level math.

Like humans, it wasn't equally capable across all mathematical domains.

The experiment was set up to mimic mathematicians who are excellent at proving inequalities, bounds, finding optimal solutions, etc. So more like Ramanujan and Erdős in their focus on a computationally-driven and problem-focused approach.

j2kun

6 minutes ago

> that mimics how people do real math

Real people do not do math like AlphaEvolve...

vatsachak

2 hours ago

As Daniel Litt pointed out on Twitter, this was the first time a lot of those problems were hit with a lot of compute. Some of AlphaEvolve's inequalities were beaten rather easily by humans and Moore's law

https://arxiv.org/abs/2506.16750

tornikeo

6 hours ago

I love this. I think of mathematics as writing programs but for brains. Not all programs are useful and to use AI for writing less useful programs would generally save humans our limited time. Maybe someday AI will help make even more impactful discoveries?

Exciting times!

analog8374

2 hours ago

It's like the "Truth Mines" from Greg Egan's "Diaspora".

muldvarp

6 hours ago

There seems to be zero reason for anyone to invest any time into learning anything besides trades anymore.

AI will be better than almost all mathematicians in a few years.

andrepd

6 hours ago

I'm very sorry for anyone with such a worldview.

throwaway0123_5

2 hours ago

Are you saying this because you think that people should still try to learn things for personal interest in a world where AI makes learning things to make money pointless (I agree completely, though what I spend time learning would change), or do disagree with their assessment of where AI capabilities are heading?

muldvarp

an hour ago

Ok. Can you explain why?

never_inline

4 hours ago

Such an AI will invent plumber robot and welder robot as well.

muldvarp

an hour ago

Robots scale much worse than knowledge work.

But yes, I'm not bullish on trades either. Trades will suck as well when everyone tries to get into them because it's the only way to still earn a living.

stOneskull

2 hours ago

  But don't you see, I came here to find a new job, a new life, a new meaning to my existence. Can't you help me?

  Well, do you have any idea of what you want to do?

  Yes, yes I have.

  What?

  (boldly) Lion taming.

muldvarp

an hour ago

The underlying assumption here is that you won't have to earn a living anymore. Unless you already own enough to keep living off of it, you'll still have to work. That work will just suck more and pay less.