Language agents achieve superhuman synthesis of scientific knowledge

53 pointsposted 2 days ago
by rntn

22 Comments

tantalor

2 days ago

I have an idea how to test whether AI can be a good scientist:

Train on all published scientific knowledge and observations up to certain point, before a breakthrough occurred. Then see if your AI can generate the breakthrough on its own.

For example, prior to 1900 quantum theory did not exist. Given what we knew then, could AI reproduce the ideas of Planck, Einstein, Bohr etc?

If not, then AI will never be useful for generating scientific theory.

vzaliva

2 days ago

I don’t think this is the main point of the paper. They’re not claiming that AI is capable of scientific breakthroughs. Rather, they argue that AI excels at summarising vast amounts of existing scientific knowledge.

tantalor

2 days ago

That's literally what "knowledge synthesis" is. Not just summarizing, but "the combination of ideas to form a theory or system."

Breakthroughs are just a special case of synthesis.

wslh

2 days ago

Formally speaking, breakthroughs are not simply a subset of synthesis, as they can exist outside the realm of prior knowledge.

polishdude20

2 days ago

Or just have the AI generate new specific experimental setups and parameters that we can try and be like "oh yeah, we just made a room temperature superconductor".

Honestly given what we know about physics, the AI should be able to simulate physics within itself or deduce certain things we've missed.

AnimalMuppet

2 days ago

> Honestly given what we know about physics, the AI should be able to simulate physics within itself or deduce certain things we've missed.

If by "AI" you mean language models, then no, it will not "be able to simulate physics within itself". No way.

potatoman22

2 days ago

It can simulate basic problems well enough when viewed as a black box. Give it one of Galileo's experiments.

polishdude20

2 days ago

Oh no I mean if we claim we have an AGI and it's true, it should be able to do that. LLMs are not that

AnimalMuppet

2 days ago

Fair enough.

And in fact, I think that's an interesting line to consider for determining if something is in fact an AGI.

tiahura

2 days ago

Discover quantum mechanics or you’re a failure!

I hope your approach with your kids is a bit more nuanced.

tasty_freeze

2 days ago

Is your second sentence sincere? Attacking someone's parenting to win rhetorical points on an unrelated topic is pretty low.

sincerecook

2 days ago

How dare he have high expectations of the AI product!

tiahura

2 days ago

High expectations are one thing, and I’m an AGI skeptic, but when did being the smartest person ever become a requirement of AGI?

tantalor

2 days ago

Since always. That's what AGI means.

tiahura

2 days ago

AGI doesn’t have a universally accepted definition.

therobot24

2 days ago

i only performed a quick read of the paper but couldn't find how many humans they used to generate their expected human performance, this seems to be the main content:

> To ensure that we did not overfit PaperQA2 to achieve high performance on LitQA2, we generated a new set of 101 LitQA2 questions after making most of the engineering changes to PaperQA2. The accuracy of PaperQA2 on the original set of 147 questions did not differ significantly from its accuracy on the latter set of 101 questions, indicating that our optimizations in the first stage generalized well to new and unseen LitQA2 questions (Table 2).

> To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program (see Section 8.2.1), were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 73.8% ± 9.6% (mean ± SD, n = 9) precision on LitQA2 and 67.7% ± 11.9% (mean ± SD, n = 9) accuracy (Figure 2A, green line). PaperQA2 thus achieved superhuman precision on this task (t(8.6) = 3.49, p = 0.0036) and did not differ significantly from humans in accuracy (t(8.5) = −0.42, p = 0.66).

dcreater

2 days ago

Academic writing is notoriously hard to read and often poorly written. If this lives up to billing it will be a game changer - no need to rely on sporadic, manual, intrinsically limited nature of surveys from academics, analysts through to gym bros, reddit posters.

PaulKeeble

2 days ago

One of my big uses of LLM's has been searching through medical research. The issue has been a few times running into confidence where it shouldn't be but I have found it hallucinates a lot less in science topics than it does for more common topics.

andai

2 days ago

How are you searching with LLMs?

The strategies I've tried are:

1. Shove all potentially relevant data (e.g. entire book or library) into an LLM (quite expensive and the needle-haystack problem exists even in recent models iirc -- though splitting into many smaller prompts seems to solve it without substantially increasing price).

2. Vector database (in my experience overcomplicated and spotty performance, often not much better than an expanded keyword search, sometimes worse?)

3. Web search (generate queries, run them on duckduckgo, read the top N results) -- decent in theory but most top search results are crap, I need to adapt this method to use only high quality sources instead of general purpose search engines.

mdp2021

2 days ago

> hallucinates a lot less in science topics

Extremely dangerous for that one detail you will not expect also since hallucination becomes more rare; extremely dangerous in the hands of practitioners non paranoid in front of hallucination.

(Incidentally: apparently, somebody recently lost a house after having some chatbot write the contract. This is indicative of the possible level of carelessness of users.

Edit: I am trying to find that piece of news, but it seems non trivial. Maybe the original reference itself, which reported the news, was victim of hallucination? Meanwhile, I have found this noteworthy piece: firm allows people to have contractual terms presented by chatbot, which presents hallucinated terms, and loses legal action - https://mashable.com/article/air-canada-forced-to-refund-aft... )