Strilanc
6 days ago
When I went to the APS March Meeting earlier this year, I talked with the editor of a scientific journal and asked them if they were worried about LLM generated papers. They said actually their main worry wasn't LLM-generated papers, it was LLM-generated reviews.
LLMs are much better at plausibly summarizing content than they are at doing long sequences of reasoning, so they're much better at generating believable reviews than believable papers. Plus reviews are pretty tedious to do, giving an incentive to half-ass it with an LLM. Plus reviews are usually not shared publicly, taking away some of the potential embarrassment.
empiko
6 days ago
We already got an LLM generated meta review that was very clearly just summarization of reviews. There were some pretty egregious cases of borderline hallucinated remarks. This was ACL Rolling Review, so basically the most prestigious NLP venue and the editors told us to suck it up. Very disappointing and I genuinely worry about the state of science and how this will affect people who rely on scientometric criteria.
Al-Khwarizmi
6 days ago
This is a problem in general, but the unmitigated disaster that is ARR (ACL Rolling Review) doesn't help.
On the one hand, if you submit to a conference, you are forced to "volunteer" for that cycle. Which is a good idea from a "justice" point of view, but its also a sure way of generating unmotivated reviewers. Not only because a person might be unmotivated in general, but because the -rather short- reviewing period may coincide in your vacation (this happened to many people with EMNLP, whose reviewing period was in the summer) and you're not given any alternative but to "volunteer" and deal with it.
On the other hand, even regular reviewers aren't treated too well. Lately they implemented a minimum max load of 4 (which can push people towards choosing uncomfortable loads, in fact, that seems to be the purpose) and loads aren't even respected (IIRC there have been mails to the tune of "some people set a max load but we got a lot of submissions so you may get more submissions than your load, lololol").
While I don't condone using LLMs for reviewing and I would never do such a thing, I am not too surprised that these things happen given that ARR makes the already often thankless job of reviewing even more annoying.
To be honest, lately, I have gotten better quality reviews from the supposedly second-tier conferences that haven't joined ARR (e.g. this year's LREC-COLING) than from ARR. Although sample size is very small, of course.
jll29
6 days ago
Most conferences have been flooded with submissions, and ACL is no exception.
A consequence of that is that there are not sufficient numbers of reviewers available who are qualified to review these manuscripts.
Conference organizers might be keen to accept many or most who offer to volunteer, but clearly there is now a large pool of people that have never done this before, and were never taught how to do this. Add some time pressure, and people will try out some tool, just because it exists.
GPT-generated docs have a particular tone that you can detect if you've played a bit with ChatGPT and if you have a feel for language. Such reviews should be kicked out. I would be interested to view this review (anonymized if you like - by taking out bits that reveal too narrowly what it's about).
The "rolling" model of ARR is a pain, though, because instead of slaving for a month you feel like slaving (conducting scientific peer review free of charge = slave labor) all year round. Last month, I got contacted by a book editor to review a scientific book for $100. I told her I'm not going to read 350 pages, to write two pages worth of book review; to do this properly one would need two days, and I quoted my consulting day rate. On top of that, this email came in the vacation month of August. Of course, said person was never heard of again.
joshvm
6 days ago
We had what we strongly suspect is an LLM-written review for NeurIPS. It was kind of subtle if you weren't looking carefully and I can see that an AC might miss it. The suggestions for improvement weren't _wrong_, but the GPT response picked up on some extremely specific things in the paper that were mostly irrelevant (other reviewers actually pointed out the odd typo and small corrections or improvemnts where we'd made statements).
Pretty hard to combat. We just rebutted as if it were a real review - maybe it was - and hope that the chairs see it. Speaking to other folks, opinions are split over whether this sort of review should be flagged. I know some people who tried to query a review and it didn't help.
There were other small cues - the English was perfect, while other reviewers made small slips indicative of non-native speakers. One was simply the discrepancy between the tone of the review (generally very positive) and the middle-of-the-road rating and confidence. The structure of the review was very "The authors do X, Y, Z. This is important because A, B, C." and the reviewer didn't bother to fill out any of the other review sections (they just wrote single-word answeres to all of them).
The kicker was actually putting our paper in to 4o and asking it to write a review and seeing the same keywords pop up.
userbinator
6 days ago
so basically the most prestigious NLP venue
I see "dogfooding" has now been taken to its natural conclusion.
ahartmetz
6 days ago
> people who rely on scientometric criteria
Not defending LLM papers at all, but these people can go to hell. If "scientometrics" was ever a good idea, after making the measure the target, it for sure isn't anymore. A longer, carefully written, comprehensive paper is rated worse than many short, incremental, hastily written papers.
reliabilityguy
6 days ago
Well, given that the only thing that matters for tenure reviews is the “service”, i.e., roughly a list of conferences the applicant reviewed/performed some sort of service at, this is barely a surprise.
Right now there is now incentive to do a high quality review unless the reviewer is motivated.
Der_Einzige
6 days ago
With NeurIPS 2024 reviews going on right now, I'm sure that a whole lot of these kind of reviews are being generated daily.
bravura
6 days ago
With ICLR paper deadline coming up, I guess it's worth wargaming how GPT4 would review my submission.
joshvm
6 days ago
See my other post - we had exactly this for NeurIPS. It is definitely worth seeing what GPT says about your paper if only because it's a free review. The criticisms it gave us weren't wrong per se, they were just weakly backed up and it would still be up to a reviewer to judge how relevant they are or not. Every paper has downsides, but you need domain knowledge to judge if it's a small issue or a killer. Amusingly, our LLM-reviewer gave a much lower score than when we asked GPT to provide a rating (and also significantly lower than the other reviewers).
One example was that GPT took an explicit geographic location from a figure caption and used that as a reference point when suggesting improvements (along the lines of "location X is under-represented on this map") I assume because it places some high degree of relevance to figures and the abstract when summarising papers. I think you might be able to combat this by writing defensively - in our case we might have avoided that by saying "more information about geographic diversity may be found in X and the supplementary information"
zaptrem
6 days ago
Better yet, generate some adversarial perturbations to the text (or an invisible prompt) to cause it to give you a perfect review!
nextaccountic
6 days ago
Could you share it publicly or would you face adverse consequences?
If you can please publish it and maybe post here on HN or reddit.
jampekka
6 days ago
LLMs reviewing LLM generated articles via LLM editors is more or less guaranteed to become a massive thing given the incentive structures/survival pressures of everyone involved.
Researchers get massive CVs, reviewers and editors get off easy, admins get to show great output numbers from their institutions, and of course the publishers continue making hand over fist.
It's a rather broken system.
basch
6 days ago
It might follow to say that current LLM;s arent trained to generate papers, BUT they also don't really need to reason.
They just need to mimic the appearance of reason, follow the same pattern of progression. Ingesting enough of what amounts to executed templates will teach it to generate its own results as if output from the same template.
Eisenstein
6 days ago
What is the difference between 'reasoning' and 'appearing to be reasoning' if the results are the same with the same input?
ben_w
6 days ago
The outputs aren't really the same, they simply seem plausible at first glance.
For example, I recently experimented with using ChatGPT to translate a Wikipedia article, on the grounds that it mighy maintain all the formatting and that Transformer models are also used by Google Translate.
As it was an experiment, I did actually check the results before submitting the translated article.
First roughly 3/4 were fine. Final quarter was completely invented but plausible, including references.
LLMs are very useful tools, I'll gladly use them to help with various tasks and they can (with low reliability but it has happened) even manage a whole project, but right now they should treated with caution and not left unsupervised — Peter principle, being promoted beyond their competence, still applies even though they're not human employees.
Sakos
6 days ago
Because the results aren't the same? I use AI every day for software development and a number of other topics. It's very easy to recognize the points where the illusion breaks and how it breaks clearly indicates to me that there's no actual reasoning in the response I've gotten. It often feels like I'm doing the reasoning for the AI and not the other way around.
nis0s
6 days ago
From what I’ve seen, the results are not the same. In the latter scenario, there’s a risk of encountering a non sequitur all of a sudden, and the citations may be nonexistent. There’s also no guarantee that what you’re stating is factually correct when your logic is unbounded by reality.
kovezd
6 days ago
I can see how LLMs contribute to raise the standard in that field. For example, surveying related research. Also, maybe in the not too distant future, reproducing (some) of the results.
jll29
6 days ago
Writing consists of iterated re-writing (to me, anyways), i.e. better and better ways to express content 1. correctly, 2. clearly and 3. space-economically.
By writing it down (yourself) you understand what claims each piece of related work discussed has made (and can realistically make - as there sometims are inflationary lists of claims in papers), and this helps you formulate your own claim as it relates to them (new task, novel method for known task, like older method but works better, nearly as good as a past method but runs faster etc.).
If you outsource it to a machine you no longer see it through, and the result will be poor unless you are a very bad writer.
I can, however, see a role for LLMs in an electronic "learn how to write better" tutoring system.
Eisenstein
6 days ago
Does every researcher write summaries of related research themselves?
jpeloquin
5 days ago
Pretty much yes. Critical analysis is a necessary skill that needs practice. It's also necessary to be aware of the intricacies of work in one's own topic area, defined narrowly, to clearly communicate how one's own methods are similar/different to others' methods.