CSMastermind
3 months ago
The various deep research products don't work well for me. For example I asked these tools yesterday, "How many unique NFL players were on the roster for at least one regular season game during the 2024 season? I'd like the specific number not a general estimate."
I as a human know how to find this information. The game day rosters for many NFL teams are available on many sites. It would be tedious but possible for me to find this number. It might take an hour of my time.
But despite this being a relatively easy research task all of the deep research tools I tried (OpenAI, Google, and Perplexity) completely failed and just gave me a general estimate.
Based on this article I tried that search just using o3 without deep research and it still failed miserably.
simonw
3 months ago
That is an excellent prompt to tuck away in your back pocket and try again future iterations of this technology. It's going to be an interesting milestone when or if any of these systems get good enough at comprehensive research to provide a correct answer.
minraws
3 months ago
If you keep the prompt the same at some point the data will appear in training set and we might have answer.
So even though today it might be a good check it might not remain as such a good benchmark.
I think we need a way to keep updating prompts without increasing complexity in someway to properly verify model improvements. ARC Deep Research anyone?
wontonaroo
3 months ago
I used Google AI Studio instead of Google Gemini App because it provides references to the search results.
Google AI Studio gave me an exact answer of 2227 as a possible answer and linked to these comments because there is a comment further down which claims that is the exact answer. The comment was 2 hours old when I did the prompt.
It also provided a code example of how to find it using the python nfl data library mentioned in one of the comments here.
patapong
3 months ago
So the time to test data leakage from posting a question and answer to the internet, to LLMs having access to the answer is less than 2h... Does not bode well for the benchmarks of the future!
gilbetron
3 months ago
To avoid "result corruption" I asked a similar question, but for NBA players, and used O4-mini, and got a specific answer:
"For the 2023‑24 NBA regular season (which ran from October 24, 2023 to April 14, 2024), a total of 561 distinct players logged at least one game appearance, as indexed by their “Rk” on the Basketball‑Reference “Player Stats: Totals” page (the final rank shown is 561)"
Doing a quick search on my own, this number seems like it could be correct.
user
3 months ago
neom
3 months ago
Is it accurate that there are 544 rosters? If so, even at 2 minutes a roster isn't that days of work, even if you coded something? How would you go about completing this task in 1 hour as a human? (also chatgpt 4.1 gave me 2,503 and it said it used the NFL 2024 fact book)
CSMastermind
3 months ago
544 rosters but half as many games (because the teams play each other).
Technically I can probably do it in about 10 minutes because I've worked with these kind of stats before and know about packages that will get you this basically instantly (https://pypi.org/project/nfl-data-py/).
It's exactly 4 lines of code to find the correct answer, which is 2,227.
Assuming I didn't know about that package though I'd open a site like pro football reference up, middle click on each game to open the page in a new tab, click through the tabs, copy paste the rosters into sublime text, do some regex to get the names one per line, drop the new one per line list into sortmylist or a similar utility, dedupe it, and then paste it back into sublime text to get the line count.
That would probably take me about an hour.
dghlsakjg
3 months ago
If the rosters are in some sort of pretty easily parsed or scrapable format from the nfl, as sports stats typically are, this is just a matter of finding every unique name. This is something that I imagine would take less than an hour or two for a very beginner coder, and maybe a second or two for the code to actually run
raybb
3 months ago
Similarly, I asked it a rather simple question of giving me a list of AC repair places near me with my numbers. Weirdly, Gemini repeated a bunch of them like 3 or 4 times, gave some completely wrong phone numbers, and found many places hours away but labeled them as in the neighboring city.
paulsutter
3 months ago
I bet these models could create a python program that does this
Retric
3 months ago
Maybe eventually, but I bet it’s not going to work with less than 30 minutes of effort on your part.
If “It might take an hour of my time.” to get the correct answer then there’s a lower bond for trying a shortcut that might not work.
danielmarkbruce
3 months ago
This is just a bad match to the capabilities. What you are actually looking for is analysis, similar in nature to what a data scientist may do.
The deep research capabilities are much better suited to more qualitative research / aggregation.
pton_xd
3 months ago
> The deep research capabilities are much better suited to more qualitative research / aggregation.
Unfortunately sentiment analysis like "Tell me how you feel about how many players the NFL has" is just way less useful than: "Tell me how many players the NFL has."
southernplaces7
3 months ago
Your logic is.... strange...
Because it failed miserably at a very simple task of looking through some scattered charts, the human asking should blame themselves for this basic failure and trust it to do better with much harder and more specialized tasks?
lucyjojo
3 months ago
First person that makes a good exact aggregation AI will make so much money...
Precise aggregation is what so many juniors do in so many fields of work it's not even funny...
johnnyanmac
3 months ago
If AI Can't look up and read a chart, why would I trust it with any real aggregation?
oytis
3 months ago
So it's not doing well in things that we can verify/measure, but sure it's doing much better in things we can't measure - except we can't measure them, so we have no idea about how well it is doing actually. The most impressive feature of LLMs stays its ability to impress.
user
3 months ago
kenjackson
3 months ago
o3 deep research gave me an answer after I requested an exact answer again (it gave me an estimate first): 2147.