tasty_freeze
20 hours ago
One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.
But that sets expectation way too high. Partly it is due to Amdahl's law: I spend only a portion of my time coding, and far more time thinking and communicating with others that are customers of my code. Even if does make the coding 10x faster (and it doesn't most of the time) overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.
TeMPOraL
18 hours ago
Maybe it's due to a more R&D-ish nature of my current work, but for me, LLMs are delivering just as much gains in the "thinking" part as in "coding" part (I handle the "communicating" thing myself just fine for now). Using LLMs for "thinking" tasks feels similar to how mastering web search 2+ decades ago felt. Search engines enabled access to information provided you know what you're looking for; now LLMs boost that by helping you figure out what you're looking for in the first place (and then conveniently searching it for you, too). This makes trivial some tasks I previously classified as hard due to effort and uncertainty involved.
At this point I'd say about 1/3 of my web searches are done through ChatGPT o3, and I can't imagine giving it up now.
(There's also a whole psychological angle in how having LLM help sort and rubber-duck your half-baked thought makes many task seem much less daunting, and that alone makes a big difference.)
jorl17
17 hours ago
This, and if you add in a voice mode (e.g. ChatGPT's Advanced Mode), it is perfect for brainstorming.
Once I decide I want to "think a problem through with an LLM", I often start with just the voice mode. This forces me to say things out loud — which is remarkably effective (hear hear rubber duck debugging) — and it also gives me a fundamentally different way of consuming the information the LLM provides me. Instead of being delivered a massive amount of text, where some information could be wrong, I instead get a sequential system where I can stop/pause the LLM/redirect it as soon as something gets me curious or as I find problems with it said.
You would think that having this way of interacting would be limiting, as having a fast LLM output large chunks of information would let you skim through it and commit it to memory faster. Yet, for me, the combination of hearing things and, most of all, not having to consume so much potentially wrong info (what good is it to skim pointless stuff), ensures that ChatGPT's Advanced Voice mode is a great way to initially approach a problem.
After the first round with the voice mode is done, I often move to written-form brainstorming.
adi_kurian
13 hours ago
This 100%. Though think there is a personality component to this. At least I think when I speak.
seba_dos1
9 hours ago
From time to time I use an LLM to pretend to research a topic that I had researched recently, to check how much time it would have saved me.
So far, most of the time, my impression was "I would have been so badly mislead and wouldn't even know it until too late". It would have saved me some negative time.
The only thing LLMs can consistently help me with so far is typing out mindless boilerplate, and yet it still sometimes requires manual fixing (but I do admit that it still does save effort). Anything else is hit or miss. The kind of stuff it does help researching with is usually the stuff that's easy to research without it anyway. It can sometimes shine with a gold nugget among all the mud it produces, but it's rare. The best thing is being able to describe something and ask what it's called, so you can then search for it in traditional ways.
That said, search engines have gotten significantly worse for research in the last decade or so, so the bar is lower for LLMs to be useful.
TeMPOraL
7 hours ago
> So far, most of the time, my impression was "I would have been so badly mislead and wouldn't even know it until too late". It would have saved me some negative time.
That was my impression with Perplexity too, which is why I mostly stopped using it, except for when I need a large search space covered fast and am willing to double-check anything that isn't obviously correct. Most of the time, it's o3. I guess this is the obligatory "are you using good enough models" part, but it really does make a difference. Even in ChatGPT, I don't use "web search" with default model (gpt-4o) because I find it hallucinate or misinterpret results too much.
> The kind of stuff it does help researching with is usually the stuff that's easy to research without it anyway.
I disagree, but then maybe it's also a matter of attitude. I've seen co-workers do exact same research as I did, in parallel, using the same tools (Perplexity and later o3); they tend to do it 5-10x faster than I do, but then they get bad results, and I don't.
Thing is, I have an unusually high need to own the understanding of any thing I'm learning. So where some co-workers are happy to vibe-check the output of o3 and then copy-paste it to team Notion and call their research done, I'll actually read it, and chase down anything that I feel confused about, and keep digging until things start to add up and I feel I have a consistent mental model of the topic (and know where the simplifications and unknowns are). Yes, sometimes I get lost in following tangents, and the whole thing takes much longer than I feel it should, then I don't get "misled by the LLM".
I do the same with people, and sometimes they hate it, because my digging makes them feel like I don't trust them. Well, I don't - most people hallucinate way more than SOTA LLMs.
Still, the research I'm talking about, would not be easy to do without LLMs, at least not for me. The models let me dig through things that would otherwise be overwhelming or too confusing to me, or not feasible in the time I have for it.
Own your understanding. That's my rule.
lazarus01
an hour ago
< I have an unusually high need to own the understanding of any thing I'm learning
This is called deprivation sensitivity. It’s different from intellectual curiosity, where the former is a need to understand vs. the latter, which is a need to know.
Deprivation sensitivity comes with anxiety and stress. Where intellectual curiosity is associated with joyous exploration.
I score very high with deprivation sensitivity. I have unbridled drive to acquire and retain important information.
It’s a blessing and curse. An exhausting way 2 live. I love it but sometimes wish I was not neurodivergent.
seba_dos1
5 hours ago
> Thing is, I have an unusually high need to own the understanding of any thing I'm learning.
Same here. Don't get me wrong, LLMs can be helpful, but what I mean is that they can at best aid my research rather than perform it for me. In my experience, relying on them to do that would usually be disastrous - but they do sometimes help in cases where I feel stuck and would otherwise have to find some human to ask.
I guess it's the difference between "using LLMs while thinking" and "using LLM to do the thinking". The latter just does not work (unless all you're ever thinking about is trivial :P), the former can boost you up if you're smart about it. I don't think it's as big of a boost as many claim and it's still far from being reliable, but it's there and it's non-negligible. It's just that being smart about it is non-optional, as otherwise you end up with slop and don't even realize it.
solumunus
9 hours ago
I’m surprised it’s only 1/3rd. 90% of my searches for information start at Perplexity or Claude at this point.
TeMPOraL
7 hours ago
Perplexity is too bulky for queries Kagi can handle[0], and I don't want to waste o3 quota[1] on trivial lookups.
--
[0] - Though I admit that almost all my Kagi searches end in "?" to trigger AI answer, and in ~50% of the cases, I don't click on any result.
[1] - Which AFAIK still exists on Plus plan, though I haven't hit it ~two months.
wubrr
19 hours ago
> One thing I find frustrating is that management where I work has heard of 10x productivity gains. Some of those claims even come from early adopters at my work.
Similar situation at my work, but all of the productivity claims from internal early adopters I've seen so far are based on very narrow ways of measuring productivity, and very sketchy math, to put it mildly.
thunky
17 hours ago
> One thing I find frustrating is that management where I work has heard of 10x productivity gains.
That may also be in part because llms are not as big of an accelerant for junior devs as they are for seniors (juniors don't know what is good and bad as well).
So if you give 1 senior dev a souped up llm workflow I wouldn't be too surprised if they are as productive as 10 pre-llm juniors. Maybe even more, because a bad dev can actually produce negative productivity (stealing from the senior), in which case it's infinityx.
Even a decent junior is mostly limited to doing the low level grunt work, which llms can already do better.
Point is, I can see how jobs could be lost, legitimately.
Loughla
17 hours ago
The item lost is pipeline of talent in all of this though.
Precision machining is going through an absolute nightmare where the journeymen or master machinists are aging out of the work force. These were people who originally learned on manual machines, and upgraded to CNC over the years. The pipeline collapsed about 1997.
Now there are no apprentice machinists to replace the skills of the retiring workforce.
This will happen to software developers. Probably faster because they tend to be financially independent WAY sooner than machinists.
thunky
16 hours ago
> The item lost is pipeline of talent in all of this though.
Totally agree.
However, I think this pipeline has been taking a hit for a while already because juniors as a whole have been devaluing themselves: if we expect them to leave after one year, what's the point of hiring and training them? Only helping their next employer at that point.
Ferrus91
an hour ago
> However, I think this pipeline has been taking a hit for a while already because juniors as a whole have been devaluing themselves
I have seen the standards for junior devs in free fall for a few years as they hired tons of bootcamp fodder over the last few years. I have lost count of the number of whinging junior devs who think SQL or regex is 'too hard' for their poor little brains. No wonder they are being replaced by a probabilistic magician's hat.
georgemcbay
14 hours ago
Its the employers who are responsible for the fact that almost everyone working in tech (across all skill levels) will have a far easier time advancing in both pay and title by jumping jobs often.
Very few companies put any real thought into meaningful retention but they are quick to complain about turnover.
thunky
5 hours ago
Yes I agree it works both ways. Employment is a transaction and both sides are trying to optimize outcomes in their own best interest. No blame.
The health of the job market is a big factor as well.
hobs
15 hours ago
That old canard? If you pay people in a way that incentivizes them to stay, they will. If you train people and treat them right and pay them right, they wont leave. If they are, try to fix one of those things, stop blaming the juniors for their massive collusion in a market where they literally are struggling to get jobs.
user
15 hours ago
louthy
19 hours ago
> overall my productivity is 10-15% better. That is nothing to sneeze at, but it isn't 10x.
It is something to sneeze at if you are 10-15% more expensive to employ due to the cost of the LLM tools. The total cost of production should always be considered, not just throughput.
CharlesW
19 hours ago
> It is something to sneeze at if you are 10-15% more expensive to employ due to the cost of the LLM tools.
Claude Max is $200/month, or ~2% of the salary of an average software engineer.
m4rtink
19 hours ago
Does anyone actually know what the real cost for the customers will be once the free AI money no longer floods those companies?
jppope
19 hours ago
yeah there was an analysis that came out on hackernews the other day. between low demand side economics, virtually no impact to GDP, and corporate/vc subsidies going away soon we're close to finding out. Sam Altman did convince Softbank to do a 40B round though so it might be another year or two. Current estimates are that its cheaper than search to run so its probabilistic that there will be more search features swapped. OpenAi hasn't dropped their ad platform yet though, so interested to see how that goes
wubrr
19 hours ago
I'm no LLM evangelist, far from it, but I expect models of similar quality to the current bleeding-edge, will be freely runnable on consumer hardware within 3 years. Future bleeding-edge models may well be more expensive than current ones, who knows.
TeMPOraL
10 hours ago
For the purpose of keeping the costs of LLM-dependent services down, you don't need to run bleeding-edge models on single consumer GPUs. Even if it takes a hundred GPUs, it still means people can start businesses around hosting those models, and compete with the large vendors.
ls612
13 hours ago
How do the best models that can run on say a single 4090 today compare to GPT 3.5?
electroglyph
12 hours ago
Qwen 2.5 32B which is an older model at this point clearly outperforms it:
https://llm-stats.com/models/compare/gpt-3.5-turbo-0125-vs-q...
ls612
12 hours ago
Even when quantized down to 4 bits to fit on a 4090?
petra
12 hours ago
There's a potential for 100x+ lower cost of chips/energy for inference with compute-in-memory technology.
So they'll probably find a reasonable cost/value ratio.
TeMPOraL
11 hours ago
Too cheap to meter? Inference is cheap and there's no long-term or even mid-term moat here.
As long as the courts don't shut down Meta over IP issues with LLama training data, that is.
I can't stress that enough: "open source" models are what can stop the "real costs" for the customers from growing. Despite popular belief, inference isn't that expensive. This isn't Uber - stopping isn't going to make LLMs infeasible; at worst, it's just going to make people pay API prices instead of subscription prices. As long as there are "open source" models that are legally available and track SOTA, anyone with access to some cloud GPUs can provide "SOTA of 6-12 months ago" for the price of inference, which puts a hard limit on how high OpenAI, et al. can hike the prices.
But that's only as long as there are open models. If Meta loses and LLama goes away, the chilling effect will just let OpenAI, Microsoft, Anthropic and Google to set whatever prices they want.
EDIT:
I mean LLama legally going away. Of course the cat is now out of the bag, the Pandora's box has been opened; the weights are out there and you can't untrain or uninvent them. But keeping the commercial LLM offerings' prices down requires a steady supply of improved open models, and the ability for smaller companies to make a legal business out of hosting them.
bobbob27
11 hours ago
You can't just take cost of training out of the equation...
If these companies plan to stay afloat, they have to actually pay for the tens of billions they've spent at some point. That's what the parent comment meant by "free AI"
TeMPOraL
10 hours ago
Yes, you can - because of LLama.
Training is expensive, but it's not that expensive either. It takes just one of those super-rich players to pay the training costs and then release the weights, to deny other players a moat.
assuagering
6 hours ago
If your economic analysis depends on "one of those super-rich players to pay" for it to work, it isn't as much analysis as wishful thinking.
All the 100s of billions of $ put into the models so far were not donations. They either make it back to the investors or the show stops at some point.
And with a major chunk of proponent's arguments being "it will keep getting better", if you lose that what you got? "This thing can spit out boilerplate code, re-arrange documents and sometimes corrupts data silently and in hard to detect ways but hey you can run it locally and cheaply"?
TeMPOraL
5 hours ago
The economic analysis is not mine, and I though it was pretty well-known by now: Meta is not in the compute biz and doesn't want to be in it, so by releasing Llamas, it denies Google, Microsoft and Amazon the ability to build a moat around LLM inference. Commoditize your complement and all that. Meta wants to use LLMs, not sell access to them, so occasionally burning a billion dollars to train and give away an open-weight SOTA model is a good investment, because it directly and indirectly keeps inference cheap for everyone.
assuagering
5 hours ago
You understand that according to what you just said, economically the current SOTA is untenable?
Which, again, leads to a future where we're stuck with local models corrupting data about half the time.
TeMPOraL
4 hours ago
No, it just means that the big players have to keep advancing SOTA to make money; Llama lagging ~6 months behind just means there's only so much they can charge for access to the bleeding edge.
Short-term, it's a normal dynamics for a growing/evolving market. Long-term, the Sun will burn out and consume the Earth.
michaelbrave
6 hours ago
If LLama goes away we would still get models from China that don't respect the laws that shut down LLama, at least until China is on top, they will continue to undercut using open source/model. Either way, open models will continue to exist.
adi_kurian
12 hours ago
Rapid progress in open source says otherwise.
selfhoster11
17 hours ago
In the US, maybe. Several times that by percentage in other places around the world.
GeoAtreides
11 hours ago
the average software engineer makes $10000 a month after taxes?!
votepaunchy
19 hours ago
> if you are 10-15% more expensive to employ due to the cost of the LLM tools
How is one spending anywhere close to 10% of total compensation on LLMs?
bravesoul2
19 hours ago
That's a good insight be because with perfect competition it means you need to share your old salary with an LLM!
coolKid721
18 hours ago
On my personal projects it's easily 10x faster if not more in some circumstances. At work where things are planned out months in advanced and I'm working with 5 different teams to figure out the right way to do things for requirements that change 8 times during development? Even just stuff with PR review and making sure other people understand it and can access it. idk sometimes it's probably break even or that 10-15%. It just doesn't work well in some environments and what really makes it flourish (having super high quality architectural planning/designs/standardized patterns etc.) is basically just not viable at anything but the smallest startups and solo projects.
Frankly even just getting engineers to agree upon those super specificized standardized patterns is asking a ton, especially since lots of the things that help AI out are not what they are used to. As soon as you have stuff that starts deviating it can confuse the AI and makes that 10x no longer accessible. Also no one would want to review the PRs I'd make for the changes I do on my "10x" local project... Especially maintaining those standards is already hard enough on my side projects AI will naturally deviate and create noise and the challenge is constructing systems to guide that to make sure nothing deviates (since noise would lead to more noise).
I think it's mostly a rebalancing thing, if you have 1 or a couple like minded engineers who intend to do it they can get that 10x. I do not see that EVER existing in any actual corporate environment or even once you get more then like 4 people tbh.
Ai for middle management and project planning on the other hand...
datpuz
19 hours ago
It's just another tech hype wave. Reality will be somewhere between total doom and boundless utopia. But probably neither of those.
The AI thing kind of reminds me of the big push to outsource software engineers in the early 2000's. There was a ton of hype among executives about it, and it all seemed plausible on paper. But most of those initiatives ended up being huge failures, and nearly all of those jobs came back to the US.
People tend to ignore a lot of the little things that glue it all together that software engineers do. AI lacks a lot of this. Foreigners don't necessarily lack it, but language barriers, time zone differences, cultural differences, and all sorts of other things led to similar issues. Code quality and maintainability took a nosedive and a lot of the stuff produced by those outsourced shops had to be thrown in the trash.
I can already see the AI slop accumulating in the codebases I work in. It's super hard to spot a lot of these things that manage to slip through code review, because they tend to look reasonable when you're looking at a diff. The problem is all the redundant code that you're not seeing, and the weird abstractions that make no sense at all when you look at it from a higher level.
2muchcoffeeman
19 hours ago
This was what I was saying to a friend the other day. I think anyone vaguely competent that is using LLMs will make the technology look far better than it is.
Management thinks the LLM is doing most of the work. Work is off shored. Oh, the quality sucks when someone without a clue is driving. We need to hire again.
mlinsey
19 hours ago
I don't disagree with your assessment of the world today, but just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
majormajor
18 hours ago
> just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
You had to paste more into your prompts back then to make the output work with the rest of your codebase, because there weren't good IDEs/"agents" for it, but you've been able to get really really good code for 90% of "most" day to day SWE since at least OpenAI releasing the ChatGPT-4 API, which was a couple years ago.
Today it's a lot easier to demo low-effort "make a whole new feature or prototype" things than doing the work to make the right API calls back then, but most day to day work isn't "one shot a new prototype web app" and probably won't ever be.
I'm personally more productive than 1 or 2 years ago now because the time required to build the prompts was slower than my personal rate of writing code for a lot of things in my domain, but hardly 10x. It usually one-shots stuff wrong, and then there's a good chance that it'll take longer to chase down the errors than it would've to just write the thing - or only use it as "better autocomplete" - in the first place.
timr
19 hours ago
> I don't disagree with your assessment of the world today, but just 12 months ago (before the current crop of base models and coding agents like Claude Code), even that 10X improvement of writing some-of-the-code wouldn't have been true.
So? It sounds like you're prodding us to make an extrapolation fallacy (I don't even grant the "10x in 12 months" point, but let's just accept the premise for the sake of argument).
Honestly, 12 months ago the base models weren't substantially worse than they are right now. Some people will argue with me endlessly on this point, and maybe they're a bit better on the margin, but I think it's pretty much true. When I look at the improvements of the last year with a cold, rational eye, they've been in two major areas:
* cost & efficiency
* UI & integration
So how do we improve from here? Cost & efficiency are the obvious lever with historical precedent: GPUs kinda suck for inference, and costs are (currently) rapidly dropping. But, maybe this won't continue -- algorithmic complexity is what it is, and barring some revolutionary change in the architecture, LLMs are exponential algorithms.UI and integration is where most of the rest of the recent improvement has come from, and honestly, this is pretty close to saturation. All of the various AI products already look the same, and I'm certain that they'll continue to converge to a well-accepted local maxima. After that, huge gains in productivity from UX alone will not be possible. This will happen quickly -- probably in the next year or two.
Basically, unless we see a Moore's law of GPUs, I wouldn't bet on indefinite exponential improvement in AI. My bet is that, from here out, this looks like the adoption curve of any prior technology shift (e.g. mainframe -> PC, PC -> laptop, mobile, etc.) where there's a big boom, then a long, slow adoption for the masses.
mlinsey
18 hours ago
12 months ago, we had no reasoning models and even very basic arithmetic was outside of the models' grasp. Coding assistants mostly worked on the level of tab-completing individual functions, but now I can one-shot demo-able prototypes (albeit nothing production-ready) of webapps. I assume you consider the latter "integration", but I think coding is so key to how the base models are being trained that this is due to base model improvements too. This is testable - it would be interesting to get something like Claude Code running on top of a year-old open source model and see how it does.
If you're going to call all of that not substantial improvement, we'll have to agree to disagree. Certainly it's the most rapid rate of improvement of any tech I've personally seen since I started programming in the early '00s.
timr
18 hours ago
I consider the reasoning models to be primarily a development of efficiency/cost, and I thought the first one was about a year ago, but sure, ok. I don’t think it changes the argument I’m making. The LLM ourobouros / robot centipede has been done, and is not itself a path towards exponential improvement.
To be quite honest, I’ve found very little marginal value in using reasoning models for coding. Tool usage, sure, but I almost never use “reasoning” beyond that.
Also, LLMs still cannot do basic math. They can solve math exams, sure, but you can’t trust them to do a calculation in the middle of a task.
TeMPOraL
6 hours ago
> but you can’t trust them to do a calculation in the middle of a task.
You can't trust a person either. Calculating is its own mode of thinking; if you don't pause and context switch, you're going to get it wrong. Same is the case with LLMs.
Tool usage and reasoning and "agentic approach" are all in part ways for allowing LLM to do the context switch required, instead of taking the match challenge as it goes and blowing it.
timr
3 hours ago
The proper comparison is not a human, it’s a computer. Or even a human with a computer.
But my point wasn’t to judge LLMs on their (in)ability to do math - I was only responding to the parent comment’s assertion that they’ve gotten better in this area.
It’s worth noting that all of the major models still randomly decide to ignore schemas and tool calls, so even that is not a guarantee.
jorl17
17 hours ago
12 months ago, if I fed a list of ~800 poems with about ~250k tokens to an LLM and asked it to summarize this huge collection, they would be completely blind to some poems and were prone to hallucinating not simply verses but full-blown poems. I was testing this with every available model out there that could accept 250k tokens. It just wouldn't work. I also experimented with a subset that was at around ~100k tokens to try other models and results were also pretty terrible. Completely unreliable and nothing it said could be trusted.
Then Gemini 2.5 pro (the first one) came along and suddenly this was no longer the case. Nothing hallucinated, incredible pattern finding within the poems, identification of different "poetic stages", and many other rather unbelievable things — at least to me.
After that, I realized I could start sending in more of those "hard to track down" bugs to Gemini 2.5 pro than other models. It was actually starting to solve them reliably, whereas before it was mostly me doing the solving and models mostly helped if the bug didn't occur as a consequence of very complex interactions spread over multiple methods. It's not like I say "this is broken, fix it" very often! Usually I include my ideas for where the problem might be. But Gemini 2.5 pro just knows how to use these ideas better.
I have also experimented with LLMs consuming conversations, screenshots, and all kinds of ad-hoc documentation (e-mails, summaries, chat logs, etc) to produce accurate PRDs and even full-on development estimates. The first one that actually started to give good results (as in: it is now a part of my process) was, you guessed it, Gemini 2.5 pro. I'll admit I haven't tried o3 or o4-mini-high too much on this, but that's because they're SLOOOOOOOOW. And, when I did try, o4-mini-high was inferior and o3 felt somewhat closer to 2.5 pro, though, like I said, much much slower and...how do I put this....rude ("colder")?
All this to say: while I agree that perhaps the models don't feel like they're particularly better at some tasks which involve coding, I think 2.5 pro has represented a monumental step forward, not just in coding, but definitely overall (the poetry example, to this day, still completely blows my mind. It is still so good it's unbelievable).
airstrike
16 hours ago
Your comment warrants a longer, more insightful reply than I can provide, but I still feel compelled to say that I get the same feeling from o3. Colder, somewhat robotic and unhelpful. It's like the extreme opposite of 4o, and I like neither.
My weapon of choice these days is Claude 4 Opus but it's slow, expensive and still not massively better than good old 3.5 Sonnet
jorl17
15 hours ago
Exactly! Here's my take:
4o tens do be, as they say, sycophantic. It's an AI masking as a helpful human, a personal assistant, a therapist, a friend, a fan, or someone on the other end of a support call. They sometimes embellish things, and will sometimes take a longer way getting to the destination if it makes for a what may be a more enjoyable conversation — they make conversations feel somewhat human.
OpenAI's reasoning models, though, feel more like an AI masking a code slave. It is not meant to embellish, to beat around the bush or to even be nice. Its job is to give you the damn answer.
This is why the o* models are terrible for creative writing, for "therapy" or pretty much anything that isn't solving logical problems. They are built for problem solving, coding, breaking down tasks, getting to the "end" of it. You present them a problem you need solved and they give you the solution, sometimes even omitting the intermediate steps because that's not what you asked for. (Note that I don't get this same vibe from 2.5 at all)
Ultimately, it's this "no-bullshit" approach that feels incredibly cold. It often won't even offer alternative suggestions, and it certainly doesn't bother about feelings because feelings don't really matter when solving problems. You may often hear 4o say it's "sorry to hear" about something going wrong in your life, whereas o* models have a much higher threshold for deciding that maybe they ought to act like a feeling machine, rather than a solving machine.
I think this is likely pretty deliberate of OpenAI. They must for some reason believe that if the model is much concise in its final answers (though not necessarily in the reasoning process, which we can't really see), then it produces better results. Or perhaps they lose less money on it, I don't know.
Claude is usually my go-to model if I want to "feel" like I'm talking to more of a human, one capable of empathy. 2.5 pro has been closing the gap, though. Also, Claude used to be by far much better than all other models at European Portuguese (+ portuguese culture and references in general), but, again, 2.5 pro seems just as good nowadays).
On another note, this is also why I also completely understand the need for the two kinds of models for OpenAI. 4o is the model I'll use to review an e-mail, because it won't just try to remove all the humanity of it and make it the most succinct, bland, "objective" thing — which is what the o* models will.
In other words, I think: (i) o* models are supposed to be tools, and (ii) 4o-like models are supposed to be "human".
troupo
12 hours ago
> 12 months ago, if I fed a list of ~800 poems with about ~250k tokens to an LLM and asked it to summarize this huge collection, they would be completely blind to some poems and were prone to hallucinating not simply verses but full-blown poems.
for the past week claude code has been routinely ignoring CLAUDE.md and every single instruction in it. I have to manually prompt it every time.
As I was vibe coding the notes MCP mentioned in the article [1] I was also testing it with claude. At one point it just forgot that MCPs exist. It was literally this:
> add note to mcp
Calling mcp:add_note_to_project
> add note to mcp
Running find mcp.ex
... Interrupted by user ...
> add note to mcp
Running <convoluted code generation command with mcp in it>
We have no objective way of measuring performance and behavior of LLMsssk42
18 hours ago
What exactly are you basing any of your assertions off of?
timr
18 hours ago
The same sort of rigorous analysis that the parent comment used (that’s a joke, btw).
But seriously: If you find yourself agreeing with one and not the other because of sourcing, check your biases.
__loam
19 hours ago
It still isn't.
ericmcer
15 hours ago
Its great when they use AI to write a small app “without coding at all” over the weekend and then come in on Monday to brag about it and act baffled that tasks take engineers any time at all.
doug_durham
17 hours ago
How much of the communication and meetings are because traditionally code was very expensive and slow to create? How many of those meetings might be streamlined or entirely disappear in the future? In my experience there is are a lot of process around making sure that software on schedule track and that it's doing what it is supposed to do. I think that the software lifecycle is about to be reinvented.
jppope
19 hours ago
The reports from analysis of open source projects are that its something in the range of 10%-15% productivity gains... so it sounds like you're spot on
smcleod
19 hours ago
That's about right for copilots. It's much higher for agentic coding.
estomagordo
19 hours ago
[citation needed]
swader999
17 hours ago
Agentic coding had really only taken off in the last few weeks due to better pricing.
deadbabe
19 hours ago
Wait till they hear about the productivity gains from using vim/neovim.
Your developers still push a mouse around to get work done? Fire them.
ghuntley
19 hours ago
Canva has seen a 30% productivity uplift - https://fortune.com/2025/06/25/canva-cto-encourages-all-5000...
AI is the new uplift. Embrace and adapt, as a rift is forming (see my talk at https://ghuntley.com/six-month-recap/), in what employers seek in terms of skills from employees.
I'm happy to answer any questions folks may have. Currently AFK [2] vibecoding a brand new programming language [1].
[1] https://x.com/GeoffreyHuntley/status/1940964118565212606 [2] https://youtu.be/e7i4JEi_8sk?t=29722
ofjcihen
14 hours ago
There’s something hilariously Portlandia about making outlandish claims with complete confidence and then plugging your own talk.
ghuntley
9 hours ago
There’s citations to the facts in the links.
CuriouslyC
18 hours ago
And that's with 50% adoption and probably a broad distribution of tool use skill.
eviks
11 hours ago
> The productivity for software engineers is at around 30%
That would be a 70% descent?
abletonlive
18 hours ago
I’m a tech lead and I have maybe 5X output now compared to everybody else under me. Quantified by scoring tickets at a team level. I also have more responsibilities outside of IC work compared to the people under me. At this point I’m asking my manager to fire people that still think llms are just toys because I’m tired of working with people with this poor mindset. A pragmatic engineer continually reevaluates what they think they know. We are at a tipping point now. I’m done arguing with people that have a poor model of reality. The rest of us are trying to compete and get shit done. This isn’t an opinion or a game. It’s business with real life consequences if you fall behind. I’ve offered to share my workflows, prompts, setup. Guess how many of these engineers have taken me up on my offer. 1-2 and the juniors or ones that are very far behind have not.
ofjcihen
14 hours ago
It’s funny. We fired someone with this attitude Thursday. And by this attitude I mean yours.
Not necessarily because of their attitude but because it turns out the software they were shipping was ripe with security issues. Security managed to quickly detect and handle the resulting incident. I can’t say his team were sad to see him go.
blibble
18 hours ago
> I’m done arguing with people that have a poor model of reality.
isn't this the entire LLM experience?
nasduia
16 hours ago
A new copypasta is born.
abletonlive
15 hours ago
Go back to reddit
swader999
17 hours ago
"I’ve offered to share my workflows, prompts" That should all be checked in.
abletonlive
17 hours ago
It’s checked in, they have just written off llms
dgfitz
18 hours ago
I will thank God every day I don’t work with you or for you. How toxic.
abletonlive
18 hours ago
im glad I don’t have to work with you too lol.
It’s not toxic for me to expect someone to get their work done in a reasonable amount of time with the tools available to them. If you’re an accountant and you take 5X the time to do something because you have beef with excel you’re the problem. It’s not toxicity to tell you that you are a bad accountant
flextheruler
18 hours ago
You believe the cost of firing and rehiring to be cheaper than simple empirical persuasion?
You don't sound like a great lead to me, but I suppose you could be working with absolutely incompetent individuals, or perhaps your soft skills need work.
My apologies but I see only two possibilities for others not to take the time to follow your example given such strong evidence. They either actively dislike you or are totally incompetent. I find the former more often true than the latter.
abletonlive
17 hours ago
You have about 50% of HN thinking LLMs are useless and you’re commenting on an article about how it’s still magical and wishful thinking, and that this is crypto all over again. But sure, the problem is me, not the people with a poor model of reality
troupo
12 hours ago
> You have about 50% of HN thinking LLMs are useless and you’re commenting on an article about how it’s still magical and wishful thinking,
Perhaps you should try reading the article again (or maybe let some LLM summarize it for you)
> But sure, the problem is me, not the people with a poor model of reality
Is amazing how you almost literally use crypto-talk
flextheruler
18 hours ago
You believe the cost of firing and rehiring to be cheaper than simple empirical persuasion?
My apologies but that does not sound like good leadership to me. It actually sounds like you may have deficiencies in your skills as it relates to leadership. Perhaps in a few years we will have an LLM who can provide better leadership.
gabrieledarrigo
17 hours ago
Dude, if you are a tech lead, and you measure productivity by scoring tickets, you are doing it pretty badly. I would fire you instead.
Applejinx
18 hours ago
Are you the one at Ableton responsible for it ignoring the renaming of parameter names during the setState part of a Live program? Some of us are already jumping through ridiculous hoops to cover for your… mindset. There's stuff coming up that used to work and doesn't now, like in Live 12. From your response I would guess this is a trend that will hold.
We should not be having to code special 'host is Ableton Live' cases in JUCE just to get your host to work like the others.
Can you please not fire any people who are still holding your operation together?
omnimus
9 hours ago
Why do you think this person works at Ableton? From their comments it doesnt seem that they would be a fit to small cool Berlin company making tools for techno.
mattmanser
18 hours ago
You've been doing the big I am about LLMs on HN for most of your last comments.
Everyone else who raises any doubts about LLMs is an idiot and you're 10,000x better than everyone else and all your co-workers should be fired.
But what's absent from all your comments is what you make. Can you tell us what you actually do in your >500k job?
Are you, by any chance, a front-end developer?
Also, a team-lead that can't fire their subordinates isn't a team-lead, they're a number two.
solumunus
9 hours ago
You seem completely insufferable and incredibly cringeworthy.