CSMastermind
4 days ago
The various deep research products don't work well for me. For example I asked these tools yesterday, "How many unique NFL players were on the roster for at least one regular season game during the 2024 season? I'd like the specific number not a general estimate."
I as a human know how to find this information. The game day rosters for many NFL teams are available on many sites. It would be tedious but possible for me to find this number. It might take an hour of my time.
But despite this being a relatively easy research task all of the deep research tools I tried (OpenAI, Google, and Perplexity) completely failed and just gave me a general estimate.
Based on this article I tried that search just using o3 without deep research and it still failed miserably.
simonw
4 days ago
That is an excellent prompt to tuck away in your back pocket and try again future iterations of this technology. It's going to be an interesting milestone when or if any of these systems get good enough at comprehensive research to provide a correct answer.
minraws
4 days ago
If you keep the prompt the same at some point the data will appear in training set and we might have answer.
So even though today it might be a good check it might not remain as such a good benchmark.
I think we need a way to keep updating prompts without increasing complexity in someway to properly verify model improvements. ARC Deep Research anyone?
red_trumpet
4 days ago
Well, to test research capabilities, one could just adopt the year (2024->2025) in the prompt.
minraws
3 days ago
I am not sure what happens if some site keeps tracking these metrics and that manages to find its way into the training data.
There are some NBA fan sites that do keep track of some of these tournament level final metrics.
ljsprague
4 days ago
Wouldn't somebody need to answer the question below? Or do you mean the discussion of its weakness might somehow make it stronger the next time it's trained?
minraws
3 days ago
I think it can be both, what happens if discussing weakness provides more relavent links for the question and help the model that is trained scraped web data to learn somehow.
I am not sure if the model will need the exact answer or just the backlinks to site where they can find them is enough. Maybe just documenting how to do it could do the job as well...
wontonaroo
4 days ago
I used Google AI Studio instead of Google Gemini App because it provides references to the search results.
Google AI Studio gave me an exact answer of 2227 as a possible answer and linked to these comments because there is a comment further down which claims that is the exact answer. The comment was 2 hours old when I did the prompt.
It also provided a code example of how to find it using the python nfl data library mentioned in one of the comments here.
patapong
4 days ago
So the time to test data leakage from posting a question and answer to the internet, to LLMs having access to the answer is less than 2h... Does not bode well for the benchmarks of the future!
gilbetron
4 days ago
To avoid "result corruption" I asked a similar question, but for NBA players, and used O4-mini, and got a specific answer:
"For the 2023‑24 NBA regular season (which ran from October 24, 2023 to April 14, 2024), a total of 561 distinct players logged at least one game appearance, as indexed by their “Rk” on the Basketball‑Reference “Player Stats: Totals” page (the final rank shown is 561)"
Doing a quick search on my own, this number seems like it could be correct.
user
4 days ago
neom
4 days ago
Is it accurate that there are 544 rosters? If so, even at 2 minutes a roster isn't that days of work, even if you coded something? How would you go about completing this task in 1 hour as a human? (also chatgpt 4.1 gave me 2,503 and it said it used the NFL 2024 fact book)
CSMastermind
4 days ago
544 rosters but half as many games (because the teams play each other).
Technically I can probably do it in about 10 minutes because I've worked with these kind of stats before and know about packages that will get you this basically instantly (https://pypi.org/project/nfl-data-py/).
It's exactly 4 lines of code to find the correct answer, which is 2,227.
Assuming I didn't know about that package though I'd open a site like pro football reference up, middle click on each game to open the page in a new tab, click through the tabs, copy paste the rosters into sublime text, do some regex to get the names one per line, drop the new one per line list into sortmylist or a similar utility, dedupe it, and then paste it back into sublime text to get the line count.
That would probably take me about an hour.
dghlsakjg
4 days ago
If the rosters are in some sort of pretty easily parsed or scrapable format from the nfl, as sports stats typically are, this is just a matter of finding every unique name. This is something that I imagine would take less than an hour or two for a very beginner coder, and maybe a second or two for the code to actually run
krainboltgreene
4 days ago
FYI for readers: All the major leagues have a stats API, most are public, some are public and "undocumented" with tons of documentation by the community. It's quite a feat!
raybb
4 days ago
Similarly, I asked it a rather simple question of giving me a list of AC repair places near me with my numbers. Weirdly, Gemini repeated a bunch of them like 3 or 4 times, gave some completely wrong phone numbers, and found many places hours away but labeled them as in the neighboring city.
paulsutter
4 days ago
I bet these models could create a python program that does this
Retric
4 days ago
Maybe eventually, but I bet it’s not going to work with less than 30 minutes of effort on your part.
If “It might take an hour of my time.” to get the correct answer then there’s a lower bond for trying a shortcut that might not work.
danielmarkbruce
4 days ago
This is just a bad match to the capabilities. What you are actually looking for is analysis, similar in nature to what a data scientist may do.
The deep research capabilities are much better suited to more qualitative research / aggregation.
pton_xd
4 days ago
> The deep research capabilities are much better suited to more qualitative research / aggregation.
Unfortunately sentiment analysis like "Tell me how you feel about how many players the NFL has" is just way less useful than: "Tell me how many players the NFL has."
southernplaces7
4 days ago
Your logic is.... strange...
Because it failed miserably at a very simple task of looking through some scattered charts, the human asking should blame themselves for this basic failure and trust it to do better with much harder and more specialized tasks?
MyPasswordSucks
4 days ago
I think you might as well be saying "robotics fail miserably at the very simple task of jogging around the block, so why should we trust the field to be able to accurately place millions of transistors within a 25cm square of silicon?"
His point is that the two tasks are very different at their core, and deep research is better at teasing out an accurate "fuzzy" answer from a swamp of interrelated data, and a data scientist is better at getting an accurate answer for a precise, sharply-defined question from a sea of comma-separated numbers.
A human readily understands that "hold the onions, hots on the side" means to not serve any onions and to place any spicy components of the sandwich in a separate container rather than on the sandwich itself. A machine needs to do a lot of educated guessing to decide whether it's being asked to keep the onions in its "hand" for a moment or keep them off the sandwich entirely, and whether black pepper used in the barbeque sauce needs to be separated and placed in a pile along with the haberno peppers.
southernplaces7
3 days ago
You seem to misunderstand my previous comment and also the thing being criticized by the post I replied to.
I understand that there are fuzzy tasks that AIs/algorithms are terrible at, which seem really simple for a human mind, and this hasn't gone away with the latest generations of LLMs. That's fine and I wouldn't criticize an AI for failing at something like the instructions you describe, for example.
However in this case, the human was asking for very specific, cut and dry information from easily available NFL rosters. Again, if an AI fails at that, especially because you didn't phrase the question "just so", then sorry, but no, it's not much more trustworthy for deep research and data scientist inquiries.
What in any case makes you think the data scientists will use superior phrasing to tease better results under more complexity from an LLM?
user
3 days ago
simonw
4 days ago
"Don’t fall into the trap of anthropomorphizing LLMs and assuming that failures which would discredit a human should discredit the machine in the same way." - https://simonwillison.net/2025/Mar/11/using-llms-for-code/#s...
daveguy
4 days ago
What I got out of that essay is that you should discredit most responses of LLMs unless you want to do just as much or more work yourself confirming the accuracy of an unreliable and deeply flawed partner. Whereas if a human "hallucinated a non-existent library or method you would instantly lose trust in them." But, for reasons, we should either give the machine the benefit of the doubt or manually confirm everything.
simonw
4 days ago
From that same essay:
> If your reaction to this is “surely typing out the code is faster than typing out an English instruction of it”, all I can tell you is that it really isn’t for me any more. Code needs to be correct. English has enormous room for shortcuts, and vagaries, and typos, and saying things like “use that popular HTTP library” if you can’t remember the name off the top of your head.
Using LLMs as part of my coding work speeds me up by a significant amount.
danielmarkbruce
4 days ago
No logic needed. Just use them, build them, play with them. You'll figure out what they are good at and what they aren't good at.
lucyjojo
4 days ago
First person that makes a good exact aggregation AI will make so much money...
Precise aggregation is what so many juniors do in so many fields of work it's not even funny...
johnnyanmac
4 days ago
If AI Can't look up and read a chart, why would I trust it with any real aggregation?
netghost
4 days ago
Because AI is weird and does some things really well, and some things poorly. The terrible/exciting/weird part is figuring out which is which.
user
4 days ago
oytis
4 days ago
So it's not doing well in things that we can verify/measure, but sure it's doing much better in things we can't measure - except we can't measure them, so we have no idea about how well it is doing actually. The most impressive feature of LLMs stays its ability to impress.
danielmarkbruce
2 days ago
Yup. Like humans.
oytis
2 days ago
At least in a liberal society humans matter. Their opinions, judgements, tastes matter. Why should "opinion" (which is not even a real opinion) of a machine matter?
Not to say that we validate whether to trust an opinion of a human expert by them being able to deliver measurably correct judgements, the same thing LLM seem to be not good at.
danielmarkbruce
12 hours ago
You can't measure the results of your legal advice in most cases. There are all manner of things we can't measure well. We don't throw up our hands and say "then forget it". We do our best with the anecdotes and move forward.
user
4 days ago
kenjackson
4 days ago
o3 deep research gave me an answer after I requested an exact answer again (it gave me an estimate first): 2147.