throwup238
9 days ago
The leap frogging at this point is getting insane (in a good way, I guess?). The amount of time each state of the art feature gets before it's supplanted is a few weeks at this point.
LLMs were always a fun novelty for me until OpenAI DeepResearch which started to actually come up with useful results on more complex programming questions (where I needed to write all the code by hand but had to pull together lots of different libraries and APIs), but it was limited to 10/month for the cheaper plan. Then Google Deep Research upgraded to 2.5 Pro and with paid usage limits of 20/day, which allowed me to just throw everything at it to the point where I'm still working through reports that are a week or more old. Oh and it searched up to 400 sources at a time, significantly more than OpenAI which made it quite useful in historical research like identifying first edition copies of books.
Now Claude is releasing the same research feature with integrations (excited to check out the Cloudflare MCP auth solution and hoping Val.town gets something similar), and a run time of up to 45 minutes. The pace of change was overwhelming half a year ago, now it's just getting ridiculous.
user_7832
9 days ago
I agree with your overall message - rapid growth appears to encourage competition and forces companies to put their best foot forward.
However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.
Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.
I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.
garrickvanburen
8 days ago
My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.
danw1979
8 days ago
Amen to this. As soon as you ask an LLM to explain something in detail that you’re a domain expert in, that’s when you notice the flaws.
startupsfail
8 days ago
Yes, it’s particularly bad when the information found on the web is flawed.
For example, I’m not a domain expert, but I was looking for an RC motor for a toy project and OpenAI had happily tried to source a few, with Deep Research. Only the best candidate it had picked contained an obvious typo in the motor spec (68 grams instead of 680 grams), which is just impossible for a motor of specified dimensions.
parineum
8 days ago
> Yes, it’s particularly bad when the information found on the web is flawed.
It's funny you say that because I was going to echo your parents sentiment and point out it's exactly the same with any news article you read.
The majority if content these LLMs are consuming is not from domain experts.
91bananas
8 days ago
I had it generate a baseball lineup the other day, it printed out a list of the 13 kids names, then said (12 players). Just straight up miscounted what it was doing, throwing a wrench to everything else it was doing beyond that point.
mac-mc
8 days ago
He was saying that 3.5 is better than 3.7 on the same topic he knows well tho.
jeswin
8 days ago
> My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.
That's not really true, since your prompts are also getting better. Better input leads to better output remains true, even with LLMs (when you see it as a tool).
franga2000
8 days ago
Being more familiar with the topic definitely doesn't always make your prompts better. For a lot of things it doesn't really change (explain X, compare X and Y...) - and this is what is being discussed it. For giving "building" instructions (like writing code) it helps a bit, but even if you know exactly what you want it to write, getting it to do that is pretty much trial and errror (too much detail makes it follow word-for-word and produce bad code, too little and it misses important parts or makes dumb mistakes).
jm547ster
8 days ago
The opposite may be true, the more effective the model the lazier the prompting as it can seemingly handle not being micromanaged as with earlier versions.
subpixel
6 days ago
The more familiar you are with the state of “Jira hygiene” in the megacorp environment, the less hope you have that LLMs will be able to make sense of things.
That said, the “AI all the things” mandates could be the lever that ultimately accomplishes what 100+ PjMs couldn’t - making people write issues as if they really mattered. Because garbage in, garbage out.
bsenftner
8 days ago
It is like this with expert humans too. Which is why, no matter what, we will continue to require expert humans not just "in the loop" but as the critical cogs that are the loop itself, just as it as always been. However, this time around those people will have AI augmentation, and be intellectually athletes of a nature our civilization has never seen.
simsla
8 days ago
I always tell people to trust the LLM to the same extent as an intern. Avoid giving it tasks you cannot verify the correctness of.
qwertyasdfg111
8 days ago
[flagged]
user_7832
8 days ago
That is certainly the case in niche topics where published information is lacking, or needs common sense to synthesize proper outputs [1].
However in this specific example, I don't remember if it was chatgpt or gemini or 3.5 Haiku but the other(s) explained it well enough. I think I re-asked 3.5 Haiku at a later point of time, and to my complete non-surprise, it gave an answer that was quite decent.
1 - For example, the field of DIY audio - which was funnily enough the source of my question. I'm no speaker designer, but combining creativity with engineering basics/rules of thumb seems to be something LLms struggle with terribly. Ask them to design a speaker and they come up with the most vanilla, tired, textbook design - despite several existing market products that are already so much ahead/innovative.
I'm confident that if you asked an LLM an identical question for which there is more discourse - eg make an interesting/innovative phone - you'd get relatively much better results.
terminalcommand
8 days ago
I built open baffle speakers based on measurements and discussion I had with Claude. I think it is really good.
I am a novice, maybe that's why I liked it.
eru
8 days ago
Not really. I'm getting pretty good Computer Science theory out of Gemini and even ChatGPT.
tiberriver256
8 days ago
3.7 did score higher in coding benchmarks but in practice 3.5 is much better at coding. 3.7 ignores instructions and does things you didn't ask it to do.
sannee
8 days ago
I suspect that is precisely why it got better at coding benchmarks.
spaceman_2020
8 days ago
3.7 is too overactive
I prefer Gemini 2.5 pro for all code now
hombre_fatal
8 days ago
Gemini 2.5 Pro has solved problems that Claude 3.7 cannot, so I use it for the hard stuff.
But Gemini is at least as overactive as Claude, sometimes even more overactive when it comes to something like comment spam.
Of course, this can be fixed with prompting. And sometimes it feels sheepish complaining about the machine god doing most of my chore work that didn't even exist a couple years ago.
conception
8 days ago
2.5 is my “okay Claude can’t get it” but first I check my “bank account” to see if I can afford it.
ralusek
8 days ago
Isn’t 2.5 pro significantly cheaper?
yunwal
8 days ago
They're the same price, and Gemini has a large free tier.
conception
7 days ago
Not when you’re doing 500k tokens per query.
UncleEntity
8 days ago
I think it just does that to eat up your token quota and get you to upgrade.
Like, ask it a simple question and it comes up with a full repo, complete with a README and a Makefile, when all you wanted to know was how efficient a particular algorithm would be in the included code.
Can't wait until the add research to the Pro plan because, you know, I have questions...
vineyardmike
8 days ago
> I think it just does that to eat up your token quota and get you to upgrade.
If you pay for a subscription then they don’t have an incentive to use more tokens for the same answer.
It’s definitely because feedback from people has “taught” it that more boilerplate is better. It’s the same reason ChatGPT is annoyingly complementary.
kruxigt
8 days ago
[dead]
suyash
8 days ago
That has been the most annoying thing about it, so glad not paying for it anymore.
danw1979
8 days ago
Can’t you still use Sonnet 3.5 anyway ? or is that a paying subscriber feature only ?
csomar
8 days ago
Plateauing overall but apparently you can gain in certain directions while you lose on some. I've written an article a while back that current models are not that far from GPT-3.5: https://omarabid.com/gpt3-now
3.7 is definitively better at coding but you feel it lost a bit of maneuverability at other domains. For someone who wants code generated, it doesn't matter but I've found myself using DeepSeek first and then getting code output by 3.7.
fastball
8 days ago
Seems clear to me that Claude 3.7 suffers from overfitting, probably due to Anthropic seeing that 3.5 was a smash hit in the LLM coding space and deciding their North star for 3.7 should be coding benchmarks (which, like all benchmarks, do not properly capture the process of real-world coding).
If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.
airstrike
9 days ago
I too like 3.5 better than 3.7 and I use it pretty often. It's like 3.7 is better in 2 metrics but worse in 10 different ones
user
9 days ago
joshstrange
9 days ago
I use Claude mostly for coding/technical things and something about 3.7 does not feel like an upgrade. I haven't gone back to 3.5 (mostly started using Gemini Pro 2.5 instead).
I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.
3.5->3.7 (even with extended thinking) felt like a nothingburger.
mattlutze
9 days ago
The expectation that one model be top marks for all things is, imo, asking too much.
greymalik
8 days ago
Out of curiosity - can you give any examples of the programming questions you are using deep research on? I’m having a hard time thinking of how it would be helpful and could use the inspiration.
dimitri-vs
8 days ago
Easy, any research task that will take you 5 minutes to complete it's worth firing off a Deep Research request while you work on something else in parallel.
I use it a lot when documentation is vague or outdated. When Gemini/o3 can't figure something out after 2 tries. When I am working with a service/API/framework/whatever that I am very unfamiliar with and I don't even know what to Google search.
jerpint
8 days ago
Have you tried using llms.txt when available? Very useful resource
emorning3
8 days ago
I often use Chrome to valid what I think I know.
I recently asked Chrome to show me how to apply the Knuth-Bendix completion procedure to propositional logic, and I had already formed my own thoughts about how to proceed (I'm building a rewrite system that does automated reasoning).
The response convinced me that I'm not a total idiot.
I'm not an academic and I'm often wrong about theory so the validation is really useful to me.
miki_oomiri
8 days ago
"Chrome" ? What do you mean? Gemini?
scargrillo
8 days ago
That’s a perfect example of LLMs providing epistemic scaffolding — not just giving you answers, but helping you check your footing as you explore unfamiliar territory. Especially valuable when you’re reasoning through something structurally complex like rewrite systems or proof strategies. Sometimes just seeing your internal model reflected back (or gently corrected) is enough to keep you moving.
itissid
8 days ago
I've been using it for pre scoping things I have no idea about and rapidly iterating by refeeding it a version with guard rails and conditions from previous chats.
Like I wanted to scope how to build a home made TrueNAS Scale unit, it helped me with a avoiding pitfalls like knowing that I needed two GPUs minimum to run the OS and local llms, and speed up config for a CLI back up of my Dropbox locally(it told me to use the right filesystem format over ZFS to make Dropbox client work).
It has researched how I can structure my web app for building payment system on the web(something I knew nothing about) to writing small tools to talk to my document collection and index them into collections in Anki in one day.
iLoveOncall
8 days ago
Calling some APIs is leap-frogging? You could do this with GPT-3, nothing has changed except it's branded under a new name and tries to establish a (flawed) standard.
If there was truly any innovation still happening in OpenAI, Anthropic, etc., they would be working on models only, not on side features that someone could already develop over a weekend.
never_inline
8 days ago
Why would you love on-call though?
iLoveOncall
8 days ago
In my previous team most of our oncall requests came from bug reports by customers on various tools that we owned, so to be able to work on random tools that my team owned was a nice change of pace / scenery compared to working on the same thing for 3 months uninterrupted.
Now I'm in a new team where 99% of our oncall tickets come from automated alarms and 80% of them are a subset of a few issues where the root-cause isn't easy to address but there is either nothing to actually do once investigated, or the fix is a one time process that is annoying to run, so the username isn't accurate anymore :)
I still like the change of pace though, 0 worries about sprint tasks or anything else for a week every few months.
risyachka
8 days ago
What are you talking about
It is literally stagnated for a year now
All that changed is they connect more apis.
And add a thinking loop with same model powering it
This is the reason it seems fast - nothing really happens except easy things
tymscar
8 days ago
I totally agree with you, especially if you actually try using these models, not just looking at random hype posters on twitter or skewed benchmarks.
That being said, isn’t it strange how the community has polar opposite views about this? Did anything like this ever happen before?
apwell23
8 days ago
> DeepResearch which started to actually come up with useful results on more complex programming questions
Is there a youtube video of ppl using this on complex open source projects like linux kernel or maybe something like pytorch.
How come none of the oss pojects( atleast not the ones i follow) are progressing fast(er) from AI like 'deepresearch'
wilg
8 days ago
o3 since it can web search while reasoning is a really useful lighter weight deep research
ilrwbwrkhv
9 days ago
None of those reports are any good though. Maybe for shallow research, but I haven't found them deep. Can you share what kind of research you have been trying there where it has done a great job of actual deep research.
Balgair
9 days ago
I'm echoing this sentiment.
Deep Research hasn't really been that good for me. Maybe I'm just using it wrong?
Example: I want the precipitation in mm and monthly high and low temperature in C for the top 250 most populous cities in North America.
To me, this prompt seems like a pretty anodyne and obvious task for Deep Research. It's long, tedious, but mostly coming from well structured data sources (wikipedia) across two languages at most.
But when I put this in to any of the various models, I mostly get back ways to go and find that data myself. Like, I know how to look at Wikipedia, it's that I don't want to comb through 250 pages manually or try to write a script to handle all the HTML boxes. I want the LLM/model to do this days long tedious task for me.
sxg
8 days ago
That's actually not what deep research is for, although you can obviously use it however you like. Your query is just raw data collection—not research. Deep research is about exploring a topic primarily with academic and other high-quality sources. It's a starting point for your own research. Deep research creates a summary report in ~10 min from more sources than you could probably read in a month, and then you can steer the conversation from there. Alternatively, you can just use deep research's sources as a reading list for yourself so you can do your own analysis.
Balgair
8 days ago
I think we have very different definitions of the word 'research' then.
I'd say that what you're saying is 'synthesis'. The 'Intro/Discussion' sections of a journal article.
For me, 'research' means the work of going through and getting all the data in the first place. Like, going out and collecting dino bones in the hot sun, measuring all the soil samples, etc. - that is research. For me, asking these models to go collate some webpages, I mean, you spend the first weeks of a summer undergrad's time to go do this kid of thing to get them used to the file systems and spruce up their organization skills, see where they are at. Writing the paper up, that's part of research sure, but not the hard part that really matters.
sxg
8 days ago
Agreed—we're working with different definitions of "research". The deep research products from OpenAI, Google Gemini, and Perplexity seem to be more aligned with my definition of research if that helps you gain more utility from them.
tomrod
8 days ago
It's excellent at producing short literature reviews on open access papers and data. It has no sense of judgment, trusting most sources unless instructed otherwise.
fakedang
8 days ago
Gemini's Deep Research is very good at discriminating between sources though, in my experience (haven't tried Claude or Perplexity). It finds really obscure but very relevant documents that don't even show up in Google Search for the same queries. It also discounts results that are otherwise irrelevant or very low-value from the final report. But again, it is just a starting point as the generated report is too short, and I make sure to check all the references it gives once again. But that's where I find its value.
85392_school
9 days ago
The funny thing is that if your request only needed the top 100's temperature or the top 33's precipitation, it could just read "List of cities by average temperature" or "List of cities by average precipitation" and that would be it, but the top 250 requires reading 184x more pages.
My perspective on this is that if Deep Research can't do something, you should do it yourself and put the results on the internet. It'll help other humans and AIs trying to do the same task.
Balgair
8 days ago
Yeah, that was intentional, well, somewhat.
The project requires the full list of every known city in the western hemisphere and also Japan, Korea, and Taiwan. But that dataset is just maddeningly large, if it is possible at all. Like, I expect it to take me years, as I have to do a lot of translations. So, I figured that I'd be nice and just as for the top 250 for the various models.
There's a lot more data that we're trying to get too and I'm hoping that I can get approval to post it as its a work thing.
therein
8 days ago
Sounds like the you're having it conduct research and then solve the Knapsack problem for you on the collected data. We should do the same for the traveling salesman one.
How do you validate its results in that scenario? Just take its word for it?
Balgair
8 days ago
Ahh, no. We'll be doing more research on the data once we have it. Things like ranking and averages and distributions on the data will come later, but first we just need it to begin with.
wyre
8 days ago
If you have the data, but need to parse all of it, couldn’t you upload it to your LLM of choice (with a large enough context window) and have it finish your project?
Balgair
8 days ago
I'm sorry I was unclear. No, I do not have the data yet and I need to get it.
XenophileJKO
8 days ago
Well remember listing/ranking things are structurally hard for these models because you have to keep track of what it has listed and what it hasn't, etc.
spaceman_2020
8 days ago
My wife, who is writing her PhD right now and teaches undergraduate students, says they are at the level of a really bright final year undergrad
Maybe in a year, they’ll hit the graduate level. But we’re not near PhD level yet
user
9 days ago
xrdegen
8 days ago
It is because you are just such a genius that already knows everything unlike us stupid people that find these tools amazingly useful and informative.
cwillu
8 days ago
The failure mode is that people unfamiliar with a subject aren't able to distinguish careful analysis from bullshit. However the second failure mode where someone pointing that out is assumed to be calling people stupid is a longstanding wetware bug.
spaceman_2020
8 days ago
Gemini 2.5 pro was the moment for me where I really thought “this is where true adoption happens”
All those talks about AI replacing people seemed a little far fetched in 2024. But in 2025, I really think models are getting good enough
antupis
8 days ago
You still need "human in the loop" because with simple tasks or some tasks that have lots of training material, models can one-shot answer and are like super good. But if the domain grows too complex, there are some not-so-obvious dependencies, or stuff that is in bleeding edge. Models fail pretty badly. So you need someone to split those complex tasks to more simpler familiar steps.