hackernews client

Claude is good at assembling blocks, but still falls apart at creating them

315 pointsposted 24 days ago

248 Comments

woeirua

22 days ago

It's just amazing to me how fast the goal posts are moving. Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy. The tech is moving so fast. I slept on Opus 4.5 because GPT 5 was kind of an air ball, and just started using it in the past few weeks. It's so good. Way better than almost anything that's come before it. It can one-shot tasks that we never would've considered possible before.

skue

22 days ago

> Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy.

Four years ago, they would have likely asked what in the world is an LLM? ChatGPT is barely 3 years old.

enraged_camel

22 days ago

It literally saved my small startup six-figures and months of work. I've written about it extensively and posted it (it's in my submissions).

ranyume

22 days ago

There are certain things/llm-phenomena that haven't changed since their introduction.

Madmallard

22 days ago

Idk I was using chat gpt 3.5 to do stuff and it was pretty helpful then

utopiah

22 days ago

> The tech is moving so fast.

Well that's exactly the problem : how can one say that?

The entire process of evaluating what "it" actually does has been a problem from the start. Input text, output text ... OK but what if the training data includes the evaluation? This was ridiculous few years ago but then the scale went from some curated text datasets to... most of the Web as text, to most of the Web as text including transcription from videos, to most of the Web plus some non public databases, to all that PLUS (and that's just cheating) tests that were supposed to be designed to NOT be present elsewhere.

So again, that's the crux of the problem, WHAT does it actually do? Is it "just" search? Is it semantic search with search and replace, is it that plus evaluation that it runs?

Sure the scaffolding becomes bigger, the available dataset becomes larger, the compute available keeps on increasing but it STILL does not answer the fundamental question, namely what is being done. The assumption here is because the output text does solve the question ask, then "it" works, it "solved" the problem. The problem is that by definition the entire setup has been made in order to look as plausible as possible. So it's not luck that it initially appears realistic. It's not luck that it can thus pass some dedicated benchmark, but it is also NOT solving the problem.

So yes sure the "tech" is moving "so fast" but we still can't agree on what it does, we keep on having no good benchmarks, we keep on having that jagged frontier https://www.hbs.edu/faculty/Pages/item.aspx?num=64700 that makes it so challenging to make more meaningful statement than "moving so fast" which sounds like marketing claims.

computerex

22 days ago

You know LLM's have been used to solve very hard previously unsolved math problems like some of the Erdos problems?

patagurbon

22 days ago

That Erdos problem solution is believed by quite a few to be a previous result found in the literature, just used in a slightly different way. It also seems not a lack of progress but simply no one cared to give it a go.

That’s a really fantastic capability, but not super surprising.

bakkoting

22 days ago

You're thinking of a previous report from a month ago, #897 or #481, or the one from two weeks ago, #728. There's a new one from a week ago, #205, which is genuinely novel, although it is still a relatively "shallow" result.

Terence Tao maintains a list [1] of AI attempts (successful and otherwise). #205 is currently the only success in section 1, the "full solution for which subsequent literature review did not find new relevant prior partial or full solutions" section - but it is in that section.

As to speed, as far as I know the recent results are all due to GPT 5.2, which is barely a month old, or Aristotle, which is a system built on top of some frontier LLMs and which has only been accessible to the public for a month or two. I have seen multiple mathematicians report that GPT-5.2 is a major improvement in proof-writing, e.g. [2]

[1] https://github.com/teorth/erdosproblems/wiki/AI-contribution...

[2] https://x.com/AcerFur/status/1999314476320063546

utopiah

21 days ago

Thanks for the wiki link, very interesting, in particular

- the long tail aspect of the problem space ; 'a "long tail" of under-explored problems at the other, many of which are "low hanging fruit" that are very suitable for being attacked by current AI tools'

- the expertise requirement, literature review but also 'Do I understand what the key ideas of the solution are, and how the hypotheses are utilized to reach the conclusion?' so basically one must already be an expert (or able to become one) to actually use this kind of tooling

and finally the outcomes which taking into consider the previous 2 points makes it very different from what most people would assume as "AI contributions".

utopiah

22 days ago

I do, and I read Tao's comments on his usage too, that still doesn't address what I wrote.

computerex

22 days ago

How does it not address what you wrote?

utopiah

21 days ago

If I understood correctly you are giving an example of a "success" of using the technology. So that's addressing that the technology is useful or not, powerful or not, but it does not address what it actually does (maybe somebody in ChatGPT is a gnome that solved it, I'm just being provocative here to make the point) or more important that it does something it couldn't do a year ago or 5 years ago because how it is doing something new.

For example if somebody had used GPT2 with the input dataset of GPT5.2 (assuming that's the one used for Erdos problems) rather than the input dataset it had then, could it have solved those same problems? Without doing such tests it's hard to say if it moved fast, or at all. It's not because something new has been solved by it that it's new. Yes it's a reasonable assumption, but it's just that. So going for that to assuming "it" is "moving fast" is just a belief IMHO.

utopiah

21 days ago

Also something that makes the whole process very hard to verify is what I tried to address in a much older comment : whenever LLMs are used (regardless of the input dataset) by someone who is an expert in the domain (rather than an novice) how can one evaluate what's been done by whom or what? Sure again there can be a positive result, e.g a solution to a problem until now unsolved, what does it say about the tool itself versus a user who is, by definition if they are an expert, up to date on the state of thew art?

utopiah

21 days ago

Also the very fact that https://github.com/teorth/erdosproblems/wiki/AI-contribution... exist totally change the landscape. Because it's public it's safe to assume it's part of the input dataset so from now on, how does one evaluate the pace of progress, in particular for non open source models?

simonw

23 days ago

I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code:

> But in context, this was obviously insane. I knew that key and id came from the same upstream source. So the correct solution was to have the upstream source also pass id to the code that had key, to let it do a fast lookup.

I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

My guess is that Claude is trained to bias towards making minimal edits to solve problems. This is a desirable property, because six months ago a common complaint about LLMs is that you'd ask for a small change and they would rewrite dozens of additional lines of code.

I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

bblcla

23 days ago

(Author here)

> I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code

Yeah, that's fair - a friend of mine also called this out on Twitter (https://x.com/konstiwohlwend/status/2010799158261936281) and I went into more technical detail about the specific problem there.

> I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.

> I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

I'm not sure about this! There are a few things Claude does that seem unfixable even by updating CLAUDE.md.

Some other footguns I keep seeing in Python and constantly have to fix despite CLAUDE.md instructions are:

- writing lots of nested if clauses instead of writing simple functions by returning early

- putting imports in functions instead of at the top-level

- swallowing exceptions instead of raising (constantly a huge problem)

These are small, but I think it's informative of what the models can do that even Opus 4.5 still fails at these simple tasks.

ako

23 days ago

> I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.

Claude already does this. Yesterday i asked it why some functionality was slow, it did some research, and then came back with all the right performance numbers, how often certain code was called, and opportunities to cache results to speed up execution. It refactored the code, ran performance tests, and reported the performance improvements.

ekidd

23 days ago

I have been reading through this thread, and my first reaction to many of the comments was "Skill issue."

Yes, it can build things that have never existed before. Yes, it can review its own code. Yes, it can do X, Y and Z.

Does it do all these things spontaneously with no structure? No, it doesn't. Are there tricks to getting it do some of these things? Yup. If you want code review, start by writing a code review "skill". Have that skill ask Opus to fork off several subagents to review different aspects, and then synthesize the reports, with issues broken down by Critical, Major and Minor. Have the skill describe all the things you want from a review.

There are, as the OP pointed out, a lot of reasons why you can't run it with no human at all. But with an experienced human nudging it? It can do a lot.

ako

23 days ago

It's basically not very different from working with an average development team as a product owner/manager: you need to feed it specific requirements or it will hallucinate some requirements, bugs are expected, even with unit test and testers on the team. And yes, as a product owner you also make mistakes, never have all the requirements up front, but the nice thing working with a GenAI coder is that you can iterate over these requirement gaps, hallucinated requirements and bugs in minutes, not in days.

chapel

23 days ago

Those Python issues are things I had to deal with earlier last year with Claude Sonnet 3.7, 4.0, and to a lesser extent Opus 4.0 when it was available in Claude Code.

In the Python projects I've been using Opus 4.5 with, it hasn't been showing those issues as often, but then again the projects are throwaway and I cared more about the output than the code itself.

The nice thing about these agentic tools is that if you setup feedback loops for them, they tend to fix issues that are brought up. So much of what you bring up can be caught by linting.

The biggest unlock for me with these tools is not letting the context get bloated, not using compaction, and focusing on small chunks of work and clearing the context before working on something else.

bblcla

23 days ago

Arguably linting is a kind of abstraction block!

pluralmonad

23 days ago

I wonder if this is specific to Python. I've had no trouble like that with Claude generating Elixir. Claude sticks to the existing styles and paradigms quite well. Can see in the thinking traces that Claude takes this into consideration.

doug_durham

23 days ago

That's where you come in as an experienced developer. You point out the issues and iterate. That's the normal flow of working with these tools.

bblcla

23 days ago

I agree! Like I said at the end of the tool, I think Claude is a great tool. In this piece, I'm arguing against the 'AGI' believers who think it's going to replace all developers.

Kuinox

23 days ago

> My guess is that Claude is trained to bias towards making minimal edits to solve problems.

I don't have the same feeling. I find that claude tends to produce wayyyyy too much code to solve a problem, compared to other LLMs.

ehnto

22 days ago

That has been my impression too, it takes particular guidance to get it to write concise code without too much architecture airmanship.

joshribakoff

23 days ago

I expect that adding instructions that attempt to undo training produces worse results than not including the overbroad generalization in the training in the first place. I think the author isn’t making a complaint they’re documenting a tradeoff.

AIorNot

23 days ago

Well yes but the wider point is that it takes new Human skills to manage them - like a pair of horses so to speak under your bridle

When it comes down to it these AI tools are like going to power tools or machines from the artisanal era

- like going from surgical knife to a machine gun- so they operate at a faster pace without comprehending like humans - and without allowing humans time to comprehend all side effects and massive assumptions they make on every run in their context window

humans have to adapt to managing them correctly and at the right scale to be effective and that becomes something you learn

threethirtytwo

23 days ago

Definitely, The training parameters encourage this. The AI is actually deliberately also trying to trick you and we know for that for a fact.

Problems with solutions too complicated to explain or to output in one sitting are out of the question. The AI will still bias towards one shot solutions if given one of these problems because all the training is biased towards a short solution.

It's not really practical to give it training data with multi step ultra complicated solutions. Think about it. The thousands of questions given to it for reinforcement.... the trainer is going to be trying to knock those out as efficiently as possible so they have to be readable problems with shorter readable solutions. So we know AI biases towards shorter readable solutions.

Second, Any solution that tricks the reader will pass training. There is for sure a subset of questions/solution pairs that meet this criteria by definition because WE as trainers simply are unaware we are being tricked. So this data leaks into the training and as a result AI will bias towards deception as well.

So all in all it is trained to trick you and give you the best solution that can fit into a context that is readable in one sitting.

In theory we can get it to do what we want only if we had perfect reinforcement data. The reliability we're looking for seems to be just right over this hump.

maxilevi

23 days ago

LLMs are just really good search. Ask it to create something and it's searching within the pretrained weights. Ask it to find something and it's semantically searching within your codebase. Ask it to modify something and it will do both. Once you understand its just search, you can get really good results.

fennecbutt

23 days ago

I agree somewhat, but more when it comes to its use of logic - it only gleans logic from human language which as we know is a fucking mess.

I've commented before on my belief that the majority of human activity is derivative. If you ask someone to think of a new kind of animal, alien or random object they will always base it off things that they have seen before. Truly original thoughts and things in this world are an absolute rarity and the majority of supposed original thought riffs on what we see others make, and those people look to nature and the natural world for inspiration.

We're very good at taking thing a and thing b and slapping them together and announcing we've made something new. Someone please reply with a wholly original concept. I had the same issue recently when trying to build a magic based physics system for a game I was thinking of prototyping.

andy99

23 days ago

  it only gleans logic from human language

This isn’t really true, at least how I interpret the statement, little if any of the “logic” or appearance of such is learned from language. It’s trained in with reinforcement learning as pattern recognition.

Point being it’s deliberate training, not just some emergent property of language modeling. Not sure if the above post meant this, but it does seem a common misconception.

onemoresoop

23 days ago

LLMs lack agency in the sense that they have no goals, preferences, or commitments. Humans do, even when our ideas are derivative. We can decide that this is the right choice and move forward, subjectively and imperfectly. That capacity to commit under uncertainty is part of what agency actually is.

MrOrelliOReilly

23 days ago

But they do have utility functions, which one can interpret as nearly equivalent

bhadass

23 days ago

better mental model: it's a lossy compression of human knowledge that can decompress and recombine in novel (sometimes useful, sometimes sloppy) ways.

classical search simply retrieves, llms can synthesize as well.

TeMPOraL

23 days ago

Corporate wants you to find the difference...

Point being, in broad enough scope, search and compression and learning are the same thing. Learning can be phrased as efficient compression of input knowledge. Compression can be phrased as search through space of possible representation structures. And search through space of possible X for x such that F(x) is minimized, is a way to represent any optimization problem.

RhythmFox

23 days ago

This isn't strictly better to me. It captures some intuitions about how a neural network ends up encoding its inputs over time in a 'lossy' way (doesn't store previous input states in an explicit form). Maybe saying 'probabilistic compression/decompression' makes it a bit more accurate? I do not really think it connects to your 'synthesize' claim at the very end to call it compression/decompression, but I am curious if you had a specific reason to use the term.

XenophileJKO

23 days ago

It's really way more interesting that that.

The act of compression builds up behaviors/concepts of greater and greater abstraction. Another way you could think about it is that the model learns to extract commonality, hence the compression. What this means is because it is learning higher level abstractions AND the relationships between these higher level abstractions, it can ABSOLUTELY learn to infer or apply things way outside their training distribution.

bhadass

22 days ago

ya, exactly... i'd also say that when you compress large amounts of content into weights and then decompress via a novel prompt, you're also forcing interpolation between learned abstractions that may never have cooccurred in training.

that interpolation is where synthesis happens. whether it is coherent or not depends.

nextaccountic

22 days ago

Maybe the base model is just a compression of the training data?

There is also a RLHF training step on top of that

bhadass

22 days ago

yep the base model is the compression, but RLHF (and other types of post training) doesn't really change this picture, it's still working within that same compressed knowledge.

nathan lambert (who wrote the RLHF book @ https://rlhfbook.com/ ) describes this as the "elicitation theory of post training", the idea is that RLHF is extracting and reshaping what's already latent in the base model, not adding new knowledge. as he puts it: when you use preferences to change model behavior "it doesn't mean that the model believes these things. it's just trained to prioritize these things."

so like when you RLHF a model to not give virus production info, you're not necessarily erasing those weights, the theory is that you're just making it harder for that information to surface. the knowledge is still in the compression, RLHF just changes what gets prioritized during decompression.

andy99

23 days ago

No, this describes the common understanding of LLMs and adds little to just calling it AI. The search is the more accurate model when considering their actual capabilities and understanding weaknesses. “Lossy compression of human knowledge” is marketing.

XenophileJKO

23 days ago

It is fundamentally and provably different than search because it captures things on two dimensions that can be used combinatorially to infer desired behavior for unobserved examples.

1. Conceptual Distillation - Proven by research work that we can find weights that capture/influence outputs that align with higher level concepts.

2. Conceptual Relations - The internal relationships capture how these concepts are related to each other.

This is how the model can perform acts and infer information way outside of it's training data. Because if the details map to concepts then the conceptual relations can be used to infer desirable output.

(The conceptual distillation also appears to include meta-cognitive behavior, as evidenced by Anthropic's research. Which manes sense to me, what is the most efficient way to be able to replicate irony and humor for an arbitrary subject? Compressing some spectrum of meta-cognitive behavior...)

kylecazar

23 days ago

Aren't the conceptual relations you describe still, at their core, just search (even if that's extremely reductive)? We know models can interpolate well, but it's still the same probabilistic pattern matching. They identify conceptual relationships based on associations seen in vast training data. It's my understanding that models are still not at all good at extrapolation, handling data "way outside" of their training set.

Also, I was under the impression LLM's can replicate irony and humor simply because that text has specific stylistic properties, and they've been trained on it.

XenophileJKO

22 days ago

I don't know honestly, I think really the only big hole the current models have is if you have tokens that never get exposed enough to have a good learned embedding value. Those can blow the system out of the water because they cause activation problems in the low layers.

Other than that the model should be able to learn in context for most things based on the component concepts. Similar to how you learn in context.

There aren't a lot of limits in my experience. Rarely you'll hit patterns that are too powerful where it is hard for context to alter behavior, but those are pretty rare.

The models can mix and match concepts quite deeply. Certainly, if it is a completely novel concept that can't be described by a union or subtraction between similar concepts, than the model probably wouldn't handle it. In practice, a completely isolated concept is pretty rare.

DebtDeflation

23 days ago

Information Retrieval followed by Summarization is how I view it.

andrei_says_

23 days ago

“Novel” to the person who has not consumed the training data. Otherwise, just training data combined in highly probable ways.

Not quite autocomplete but not intelligence either.

pc86

23 days ago

What is the difference between "novel" and "novel to someone who hasn't consumed the entire corpus of training data, which is several orders of magnitude greater than any human being could consume?"

adrian_b

23 days ago

The difference is that when you do not know how a problem can be solved, but you know that this kind of problem has been solved countless times earlier by various programmers, you know that it is likely that if you ask an AI coding assistant to provide a solution, you will get an acceptable solution.

On the other hand, if the problem you have to solve has never been solved before at a quality satisfactory for your purpose, then it is futile to ask an AI coding assistant to provide a solution, because it is pretty certain that the proposed solution will be unacceptable (unless the AI succeeds to duplicate the performance of a monkey that would type a Shakespearean text by typing randomly).

szundi

23 days ago

[dead]

godelski

23 days ago

Are you reviewer 2?

Joking aside, I think you have too strict of a definition of novel. Unfortunately "novel" is a pretty vague word and is definitely not a binary one.

ALL models can produce "novel" data. I don't just mean ML (AI) models, but any mathematical model. The point of models is to make predictions about results that aren't in the training data. Doing interpolation between two datapoints does produce "novel" things. Thinking about the parent's comment, is "a blue tiger" novel? Probably? Are there any blue tigers in the training data? (there definitely is now thanks to K-Pop Demon Hunters) If not, then producing that fits the definition of novel. BUT I also agree that that result is not that novel. It is entirely unimpressive.

I'm saying this not because I disagree with what I believe you intend to say but because I think a major problem with these types of conversations is that many people are going to interpret you more literally and dismiss you because "it clearly produces novel things." It isn't just things being novel to the user, though that is also incredibly common and quite telling that people make such claims without also checking Google...

Speaking of that, I'm just going to leave this here... I'm still surprised this is a real and serious presentation... https://www.youtube.com/watch?v=E3Yo7PULlPs&t=616s

soulofmischief

23 days ago

Citation needed that grokked capabilities in a sufficiently advanced model cannot combinatorially lead to contextually novel output distributions, especially with a skilled guiding hand.

arcanemachiner

23 days ago

Pretty sure burden of proof is on you, here.

soulofmischief

23 days ago

It's not, because I haven't ruled out the possibility. I could share anecdata about how my discussions with LLMs have led to novel insights, but it's not necessary. I'm keeping my mind open, but you're asserting an unproven claim that is currently not community consensus. Therefore, the burden of proof is on you.

adrian_b

23 days ago

I agree that after discussions with a LLM you may be led to novel insights.

However, such novel insights are not novel due to the LLM, but due to you.

The "novel" insights are either novel only to you, because they belong to something that you have not studied before, or they are novel ideas that were generated by yourself as a consequence of your attempts to explain what you want to the LLM.

It is very frequent for someone to be led to novel insights about something that he/she believed to already understand well, only after trying to explain it to another ignorant human, when one may discover that the previous supposed understanding was actually incorrect or incomplete.

soulofmischief

23 days ago

The point is that the combined knowledge/process of the LLM and a user (which could be another LLM!) led to it walking the manifold in a way that produced a novel distribution for a given domain.

I talk with LLMs for hours out of the day, every single day. I'm deeply familiar with their strengths and shortcomings on both a technical and intuitive level. I push them to their limits and have definitely witnessed novel output. The question remains, just how novel can this output be? Synthesis is a valid way to produce novel data.

And beyond that, we are teaching these models general problem-solving skills through RL, and it's not absurd to consider the possibility that a good enough training regimen cannot impart deduction/induction skills into a model that are powerful enough to produce novel information even via means other than direct synthesis of existing information. Especially when given affordances such as the ability to take notes and browse the web.

irishcoffee

23 days ago

> I push them to their limits and have definitely witnessed novel output.

I’m quite curious what these novel outputs are. I imagine the entire world would like to know of an LLM producing completely, never-before-created outputs which no human has ever thought before.

Here is where I get completely hung up. Take 2+2. An LLM has never had 2 groups of two items and reached the enlightenment of 2+2=4

It only knows that because it was told that. If enough people start putting 2+2=3 on the internet who knows what the LLM will spit out. There was that example a ways back where an LLM would happily suggest all humans should eat 1 rock a day. Amusingly, even _that_ wasn’t a novel idea for the LLM, it simply regurgitated what it scraped from a website about humans eating rocks. Which leads to the crux: how much patently false information have LLMs scraped that is completely incorrect?

soulofmischief

23 days ago

This is not a correct approximation of what happens inside an LLM. They form probabilistic logical circuits which approximate the world they have learned through training. They are not simply recalling stored facts. They are exploiting organically-produced circuitry, walking a manifold, which leads to the ability to predict the next state in a staggering variety of contexts.

As an example: https://arxiv.org/abs/2301.05217

It's not hard to imagine that a sufficiently developed manifold could theoretically allow LLMs to interpolate or even extrapolate information that was missing from the training data, but is logically or experimentally valid.

irishcoffee

22 days ago

So you do agree that an LLM cannot derive math from first principals, or no? If an LLM had only ever seen 1+1=2 and that was the only math they were ever exposed to, along with the numbers 0-10, could an LLM figure out that 2+2=4?

I argue absolutely not. That would be a fascinating experiment.

Hell, train it on every 2-number addition combination of m+n where m and n can be any number between 1-100 (or 0-100 would be better) BUT 2, and have it figure out what 2+2 is.

I would probably change my opinion about “circuits”, which by the way really stretches the idea of a circuit. The “circuit” is just the statistically most likely series of tokens that you’re drawing pretend lines between. Sure, technically connect-the-dots is a circuit, but not in the way you’re implying, or that paper.

soulofmischief

22 days ago

> If an LLM had only ever seen 1+1=2 and that was the only math they were ever exposed to, along with the numbers 0-10, could an LLM figure out that 2+2=4?

What? Of course not? Could you? Do you understand just how much work has gone into proving that 1 + 1 = 2? Centuries upon centuries of work, reformulating all of mathematics several times in the process.

> Hell, train it on every 2-number addition combination of m+n where m and n can be any number between 1-100 (or 0-100 would be better) BUT 2, and have it figure out what 2+2 is.

If you read the paper I linked, it shows how a constrained modular addition is grokked by the model. Give it a read.

> The “circuit” is just the statistically most likely series of tokens that you’re drawing pretend lines between.

That is not what ML researchers mean when they say circuit, no. Circuits are features within the weights. It's understandable that you'd be confused if you do not have the right prior knowledge. Your inquiries are good, but they should stop as inquiries.

If you wish to push them to claims, you first need to understand the space better, understand what modern research does and doesn't show, and turn your hypotheses into testable experiments, collect and publish the results. Or wait for someone else to do it. But the scientific community doesn't accept unfounded conjecture, especially from someone who is not caught up with the literature.

irishcoffee

22 days ago

My 4-year-old kid was able to figure out 2+2=4 after I taught them 1+1=2. All 3 of them actually, all at 4-5 years old.

Turns out counting 2 sets of two objects (1… 2… 3… 4…) isn’t actually hard to do if you teach the kid how to count to 10 and that 1+1=2

I guess when we get to toddler stage of LLMs I’ll be more interested.

soulofmischief

22 days ago

That's wonderful, but you are ignoring that your kid comes built in with a massive range of biological priors, built by millions of years of evolution, which make counting natural and easy out of the box. Machine learning models have to learn all of these things from scratch.

And does your child's understanding of mathematics scale? I'm sure your 4-year-old would fail at harder arithmetic. Can they also tell me why 1+1=2? Like actually why we believe that? LLMs can do that. Modern LLMs are actually insanely good at not just basic algebra, but abstract, symbolic mathematics.

You're comparing apples and oranges, and seem to lack foundational knowledge in mathematics and computer science. It's no wonder this makes no sense to you. I was more patient about it before, but now this conversation is just getting tiresome. I'd rather spend my energy elsewhere. Take care, have a good day.

irishcoffee

22 days ago

I hope you restore your energy, I had no idea this was so exhausting! Truly, I'll stop afflicting my projected lack of knowledge, sorry I tired you out!

hrimfaxi

22 days ago

Ah man, I was curious to read your response about priors.

> If an LLM had only ever seen 1+1=2 and that was the only math they were ever exposed to, along with the numbers 0-10, could an LLM figure out that 2+2=4?

Unless you locked your kid in a room since birth with just this information, it is not the same kind of set up is it?

user

22 days ago

[deleted]

baq

22 days ago

You compared a LLM blob of numbers to a child.

irishcoffee

22 days ago

Everyone else compared them to college interns, I was being generous.

soulofmischief

22 days ago

No, you were being arrogant and presumptuous, providing flawed analogies and using them as evidence for unfounded and ill-formed claims about the capabilities of frontier models.

Lack of knowledge is one thing, arrogance is another.

emp17344

23 days ago

You could find a pre-print on Arxiv to validate practically any belief. Why should we care about this particular piece of research? Is this established science, or are you cherry-picking low-quality papers?

soulofmischief

23 days ago

I don't need to reach far to find preliminary evidence of circuits forming in machine learning models. Here's some research from OpenAI researchers exploring circuits in vision models: https://distill.pub/2020/circuits/ Are these enough to meet your arbitrary quality bar?

Circuits are the basis for features. There is still a ton of open research on this subject. I don't care what you care about, the research is still being done and it's not a new concept.

dcre

22 days ago

I really don’t think search captures the thing’s ability to understand complex relationships. Finding real bugs in 2000 line PRs isn’t search.

cultureulterior

23 days ago

This is not true.

andoando

22 days ago

Im not sure how anyone can say this. It is really good search, but its also able to combine ideas and reason about and do fairly complex logic on tasks surely absolutely no one has asked before.

MatrixMan

22 days ago

Its a very useful model but not a complete one. You just gotta acknowledge that if you're making something new its gonna take all day and require a lot of guard rails, but then you can search for that concept later (add the repo to the workspace and prompt at it) and the agent will apply it elsewhere as if it was a pattern in widespread use. "Just search" doesn't quite fit. I've never wondered how best to use a search engine to make something in a way that will be easily searchable later.

johnisgood

23 days ago

Calling it "just search" is like calling a compiler "just string manipulation". Not false, but aggressively missing the point.

emp17344

23 days ago

No, “just search” is correct. Boosters desperately want it to be something more, but it really is just a tool.

johnisgood

23 days ago

Yes, it is a tool. No, it is not "just search".

Is your CPU running arbitrary code "just search over transistor states"?

Calling LLMs "just search" is the kind of reductive take that sounds clever while explaining nothing. By that logic, your brain is "just electrochemical gradients".

RhythmFox

23 days ago

I mean, actually not a bad metaphor, but it does depend on the software you are running as to how much of a 'search' you could say the CPU is doing among its transistor states. If you are running an LLM then the metaphor seems very apt indeed.

jvanderbot

23 days ago

What would you add?

To me it's "search" like a missile does "flight". It's got a target and a closed loop guidance, and is mostly fire and forget (for search). At that, it excels.

I think the closed loop+great summary is the key to all the magic.

bitwize

23 days ago

Which is kind of funny because my standard quip is that AI research, beginning in the 1950s/1960s, and indeed much of late 20th century computer tech especially along the Boston/SV axis, was funded by the government so that "the missile could know where it is". The DoD wanted smarter ICBMs that could autonomously identify and steer toward enemy targets, and smarter defense networks that could discern a genuine missile strike from, say, 99 red balloons going by.

soulofmischief

23 days ago

It's a prediction algorithm that walks a high-dimensional manifold, in that sense all application of knowledge it just "search", so yes, you're fundamentally correct but still fundamentally wrong since you think this foundational truth is the end and beginning of what LLMs do, and thus your mental model does not adequately describe what these tools are capable of.

jvanderbot

23 days ago

Me? My mental model? I gave an analogy for Claude not a explanation for LLMs.

But you know what? I was mentally thinking of both deep think / research and Claude code, both of which are literally closed loop. I see this is slightly off topic b/c others are talking about the LLM only.

soulofmischief

23 days ago

Sorry, I should have said "analogy" and not "mental model", that was presumptuous. Maybe I also should have replied to the GP comment instead.

Anyway, since we're here, I personally think giving LLMs agency helps unlock this latent knowledge, as it provides the agent more mobility when walking the manifold. It has a better chance at avoiding or leaving local minima/maxima, among other things. So I don't know if agentic loops are entirely off-topic when discussing the latent power of LLMs.

8note

22 days ago

i dont disagree, but i also dont think thats an exciting result. every proboem can be described as a search for the right SOP, followed by execution of that SOP.

an LLM to do the search, and the agent to execute the instructions can do everything under the sun

gubicle

22 days ago

[dead]

user

23 days ago

[deleted]

maxilevi

23 days ago

I don't mean search in the reductionist way but rather that its much better at translating, finding and mapping concepts if everything is provided vs creating from scratch. If it could truly think it would be able to bootstrap creations from basic principles like we do, but it really can't. Doesn't mean its not a great powerful tool.

ordinaryatom

23 days ago

> If it could truly think it would be able to bootstrap creations from basic principles like we do, but it really can't.

alphazero?

maxilevi

23 days ago

I just said LLMs

ordinaryatom

23 days ago

You are right that LLM and alphazero are different models, but given that alphazero demonstrated having the ability to bootstrap creations, we can't easily rule out LLM also has this ability?

emp17344

23 days ago

This doesn’t make sense. They are fundamentally different things, so an observation made about Alphazero does not help you learn anything about LLMs.

ordinaryatom

23 days ago

I am not sure, self-play with LLMs self generated synthetic data is becoming a trendy topic in LLMs research.

godelski

23 days ago

  > Once you understand its just search, you can get really good results.

I think this is understating the issue, ignoring context. It reminds me of how easy people claim searching is with search engines. But there's so many variables that can make results change dramatically. Just like Google search, two people can type in the exact same query and get very different results. But probably the bigger difference is in what people are searching for.

What's problematic with these types of claims is that they just come off as calling anyone who thinks differently dumb. It's as disconnected as saying "It's intuitive" in one breath and "You're holding it wrong" in another. It's a bad mindset to be in as an engineer because someone presents a problem and instead of trying to address it is dismissed. If someone is holding it wrong, it probably isn't intuitive[0]. Even if they can't explain the problem correctly, they are telling you a problem exists[1]. That's like 80% of the job of an engineer: figuring out what the actual problem is.

As maybe an illustrative example people joke that a lot of programming is "copy pasting from stack overflow". We all know the memes. There's definitely times where I've found this to be a close approximation to writing an acceptable program. But there's many other times where I've found that to be far from possible. There's definitely a strong correlation to what type of programming I'm doing, as in what kind of program I'm writing. Honestly, I find this categorical distinction not being discussed enough with things like LLMs. Yet, we should expect there to be a major difference. Frankly, there are just different amounts of information on different topics. Just like how LLMs seem to be better with more common languages like Python than less common languages (and also worse at just more complicated languages like C or Rust).

[0] You cannot make something that's intuitive to all people. But you can make it intuitive for most people. We're going to ignore the former case because the size should be very small. If 10% of your users are "holding it wrong" then the answer is not "10% of your users are absolute morons" it is "your product is not as intuitive as you think." If 0.1% of your users are "holding it wrong" then well... they might be absolute morons.

[1] I think I'm not alone in being frustrated with the LLM discourse as it often feels like people trying to gaslight me into believing the problems I experience do not exist. Why is it so surprising that people have vastly differing experiences? *How can we even go about solving problems if we're unwilling to acknowledge their existence?*

disconcision

23 days ago

I've yet to be convinced by any article, including this one, that attempts to draw boxes around what coding agents are and aren't good at in a way that is robust on a 6 to 12 month horizon.

I agree that the examples listed here are relatable, and I've seen similar in my uses of various coding harnesses, including, to some degree, ones driven by opus 4.5. But my general experience with using LLMs for development over the last few years has been that:

1. Initially models could at best assemble a simple procedural or compositional sequences of commands or functions to accomplish a basic goal, perhaps meeting tests or type checking, but with no overall coherence,

2. To being able to structure small functions reasonably,

3. To being able to structure large functions reasonably,

4. To being able to structure medium-sized files reasonably,

5. To being able to structure large files, and small multi-file subsystems, somewhat reasonably.

So the idea that they are now falling down on the multi-module or multi-file or multi-microservice level is both not particularly surprising to me and also both not particularly indicative of future performance. There is a hierarchy of scales at which abstraction can be applied, and it seems plausible to me that the march of capability improvement is a continuous push upwards in the scale at which agents can reasonably abstract code.

Alternatively, there could be that there is a legitimate discontinuity here, at which anything resembling current approaches will max out, but I don't see strong evidence for it here.

Uehreka

23 days ago

It feels like a lot of people keep falling into the trap of thinking we’ve hit a plateau, and that they can shift from “aggressively explore and learn the thing” mode to “teach people solid facts” mode.

A week ago Scott Hanselman went on the Stack Overflow podcast to talk about AI-assisted coding. I generally respect that guy a lot, so I tuned in and… well it was kind of jarring. The dude kept saying things in this really confident and didactic (teacherly) tone that were months out of date.

In particular I recall him making the “You’re absolutely right!” joke and asserting that LLMs are generally very sycophantic, and I was like “Ah, I guess he’s still on Claude Code and hasn’t tried Codex with GPT 5”. I haven’t heard an LLM say anything like that since October, and in general I find GPT 5.x to actually be a huge breakthrough in terms of asserting itself when I’m wrong and not flattering my every decision. But that news (which would probably be really valuable to many people listening) wasn’t mentioned on the podcast I guess because neither of the guys had tried Codex recently.

And I can’t say I blame them: It’s really tough to keep up with all the changes but also spend enough time in one place to learn anything deeply. But I think a lot of people who are used to “playing the teacher role” may need to eat a slice of humble pie and get used to speaking in uncertain terms until such a time as this all starts to slow down.

orbital-decay

23 days ago

> in general I find GPT 5.x to actually be a huge breakthrough in terms of asserting itself when I’m wrong

That's just a different bias purposefully baked into GPT-5's engineered personality on post-training. It always tries to contradict the user, including the cases where it's confidently wrong, and keeps justifying the wrong result in a funny manner if pressed or argued with (as in, it would have never made that obvious mistake if it wasn't bickering with the user). GPT-5.0 in particular was extremely strongly finetuned to do this. And in longer replies or multiturn convos, it falls into a loop on contradictory behavior far too easily. This is no better than sycophancy. LLMs need an order of magnitude better nuance/calibration/training, this requires human involvement and scales poorly.

Fundamental LLM phenomena (ICL, repetition, serial position biases, consequences of RL-based reasoning etc) haven't really changed, and they're worth studying for a layman to get some intuition. However, they vary a lot model to model due to subtle architectural and training differences, and impossible to keep up because there are so many models and so few benchmarks that measure these phenomena.

Uehreka

22 days ago

By the time I switched to GPT 5 we were already on 5.1, so I can't speak to 5.0. All I can say is that if the answer came down to something like "push the bias in the other direction and hope we land in the right spot"... well, I think they landed somewhere pretty good.

Don't get me wrong, I get a little tired of it ending turns with "if you want me to do X, say the word." But usually X is actually a good or at least reasonable suggestion, so I generally forgive it for that.

To your larger point: I get that a lot of this comes down to choices made about fine tuning and can be easily manipulated. But to me that's fine. I care more about if the resulting model is useful to me than I do about how they got there.

zarzavat

22 days ago

I find both are useful.

Claude is my loyal assistant who tries its best to do what I tell it to.

GPT-5 is the egotistical coworker who loves to argue and point out what I'm doing wrong. Sometimes it's right, sometimes it's confidently wrong. It's useful to be told I'm wrong even when I'm not. But I'm not letting it modify my code, it can look but not touch.

raducu

22 days ago

> That's just a different bias purposefully baked into GPT-5's engineered personality on post-training.

I want to highlight this realization! Just because a model says something cool, it doesn't mean it's an emergent behavior/realization, but more likely post-training.

My recent experience with claude code cli was exactly this.

It was so hyped here and elsewhere I gave it a try and I'd say it's almost arrogant/petulant.

When I pointed out bugs in long sessions it tried to gaslight me that everything was alright, faked tests to prove his point.

Nora23

22 days ago

By the time GPT 5.5 landed we were already on 5.1, honestly they seem to converge on similar limitations around compositional reasoning.

aeneas_ory

22 days ago

"Still on Claude Code" is a funny statement, given that the industry is agreeing that Anthropic has the lead in software generation while others (OpenAI) are lagging behind or have significant quality issues (Google) in their tooling (not the models). And Anthropic frontier models are generally "You're absolutely right - I apologize. I need to ..." everytime they fuck something up.

zeroonetwothree

22 days ago

Why is it every time anyone has a critique someone has to say “oh but you aren’t using model X, which clearly never has this problem and is far better”?

Yet the data doesn’t show all that much difference between SOTA models. So I have a hard time believing it.

Uehreka

22 days ago

GP here: My problem with a lot of studies and data is that they seem to measure how good LLMs are at a particular task, but often don't account for "how good the LLM is to work with". The latter feels extremely difficult to quantify, but matters a lot when you're having a couple dozen turns of conversation with an LLM over the course of a project.

Like, I think there's definitely value in prompting a dozen LLMs with a detailed description of a CMS you want built with 12 specific features, a unit testing suite and mobile support, and then timing them to see how long they take and grading their results. But that's not how most developers use an LLM in practice.

Until LLMs become reliable one-shot machines, the thing I care most about is how well they augment my problem solving process as I work through a problem with them. I have no earthly idea of how to measure that, and I'm highly skeptical of anyone who claims they do. In the absence of empirical evidence we have to fall back on intuition.

CJefferson

22 days ago

A friend recommended to me having a D&D style roleplay with some different engines, to see which you vibe with. I thought this sounded crazy but I took their advice.

I found this worked suprisingly well, I was certain 'claude' was best, while they like grok and someone else liked ChatGPT. Some AIs just end up fitting best with how you like to chat I think. I do definately also find claude best for coding with as well.

fragmede

22 days ago

Because they are getting better. They're still far from perfect/AGI/ASI, but when was the last time you saw the word "delve"? So the models are clearly changing, the question is why doesn't the data show That they're actually better?

Thing is, everyone knows the benchmarks are being gamed. Exactly how is besides the point. In practice, anecdotally, Opus 4.5 is noticably better than 4, and GPT 5.2 has also noticably improved. So maybe the real question is why do you believe this data when it seems at odds with observations by humans in the field?

> Jeff Bezos: When the data and the anecdotes disagree, the anecdotes are usually right.

https://articles.data.blog/2024/03/30/jeff-bezos-when-the-da...

troupo

22 days ago

"They don't use delve anymore" is not really a testament that they became better.

Most of what I can do now with them I could do half a year to a year ago. And all the mistakes and fail loops are still there, across all models.

What changed is the number of magical incantations we throw at these models in the form of "skills" and "plugins" and "tools" hoping that this will solve the issue at hand before the context window overflows.

kaffekaka

22 days ago

"They dont say X as often anymore" is just a distraction, it has nothing to do with actual capability of the model.

Unfortunately, I think that the overlap between actual model improvements and what people perceive as "better" is quite small. Combine this with the fact that most people desperately want to have a strong opinion on stuff even though the factual basis is very weak.. "But I can SEE it is X now".

fatherwavelet

22 days ago

The type of person who outsources their thinking to their social media feed news stories and isn't intellectually curious enough to deeply explore the models themselves in order for the models to display their increase in strength, isn't going to be able to tell this themselves.

I would think this also correlates with the type of person who hasn't done enough data analysis themselves to understand all the lies and misleading half-truths "data" often tells. In the reverse also, that experience with data inoculates one to some degree against the bullshitting LLM so it is probably easier to get value from the model.

I would imagine there are all kinds of factors like this that multiple so some people are really having vastly different experiences with the models than others.

jihadjihad

22 days ago

Because the answer to the question, “Does this model work for my use case?” is subjective.

user

22 days ago

[deleted]

raincole

22 days ago

People desperately want 'the plateau' to be true because it means our jobs would be safe and we could call ourselves experts again. If the ground is keep moving then no one is truly an expert. There is just no enough time to achieve expertise when the paradigm shifts every six months.

CuriouslyC

22 days ago

That statement is only true if you're ignoring higher order patterns. I called the orchestration trend and the analytic hurdle trends back in April of last year.

alternatetwo

23 days ago

Claude is still just like that once you’re deep enough in the valley of the conversation. not exactly that phrase but things like that’s the smoking gun or so. nothing has changed.

raducu

22 days ago

> Claude is still just like that once you’re deep enough in the valley of the conversation

My experience is claude (but probably other models as well) indeed resort to all sorts of hacks once the conversation has gone for too long.

Not sure if it's an emergent behavior or something done in later stages of training to prevent it from wasting too many tokens when things are clearly not going well.

PaulDavisThe1st

22 days ago

> I haven’t heard an LLM say anything like that since October, and in general I find GPT 5.x

It said precisely that to me 3 or 4 days ago when I questioned its labelling of algebraic terms (even though it was actually correct).

overgard

22 days ago

I don't see a reason to think we're not going to hit a plateua sooner or later (and probably sooner). You can't scale your way out of hallucinations, and you can't keep raising tens of billions to train these things without investors wanting a return. Once you use up the entire internets worth of stack overflow responses and public github repositories you run into the fact that these things aren't good at doing things outside their training dataset.

Long story short, predicting perpetual growth is also a trap.

visarga

22 days ago

> You can't scale your way out of hallucinations

You scale your way only out in verifiable domains, like code, math, optimizations, games and simulations. In all the other domains the AI developers still got billions (trillions) of tokens daily, which are validated by follow up messages, minutes or even days later. If you can study longitudinally you can get feedback signals, such as when people apply the LLM idea in practice and came back to iterate later.

raducu

22 days ago

> Once you use up the entire internets worth of stack overflow responses and public github repositories you run into the fact that these things aren't good at doing things outside their training dataset.

I think the models have reached that human training data limitation a few generations ago, yet they stil clearly improve by various other techniques.

Q6T46nT668w6i3m

22 days ago

On balance, there’s far more evidence to support the conclusion that language models have reached a plateau.

FuckButtons

22 days ago

I’m not sure I agree, it doesn’t feel like we’re getting super linear growth year over year, but Claude opus 4.5 is able to do useful work over meaningful timescales without supervision. Is the code perfect? No, but that was certainly not true of model generations a year or two ago.

jgalt212

22 days ago

To me this seems like a classic LLM defense.

A doesn't work. You must frontier model 4.

A works on 4, but B doesn't work on 4. You doing it wrong, you must use frontier model 5.

Ok, now I use 5, A and B work, but C doesn't work. Fool, you must use frontier model 6.

Ok, I'm on 6, but now A is not working as it good as it did on A. Only fools are still trying to do A.

soulofmischief

22 days ago

Opus 4.5 seems to be better than GPT 5.2 or 5.2 Codex at using tools and working for long stretches on complex tasks.

MoltenMan

23 days ago

I agree with a lot of what you've said, but I completely disagree that LLM's are no longer sycophantic. GPT-5 is definitely still very sycophantic, 'You're absolutely right!' still happens, etc. It's true it happens far less in a pure coding context (Claude Code / Codex) but I suspect only because of the system prompts, and those tools are by far in the minority of LLM usage.

I think it's enlightening to open up ChatGPT on the web with no custom instructions and just send a regular request and see the way it responds.

danpalmer

22 days ago

I used to get made up APIs in functions, now I get them in modules. I used to get confidently incorrect assertions in files now I get them across codebases.

Hell, I get poorly defined APIs across files and still get them between functions. LLMs aren't good at writing well defined APIs at any level of the stack. They can attempt it at levels of the stack they couldn't a year ago, but they're still terrible at it unless the problem is so well known enough that they can regurgitate well reviewed code.

refactor_master

22 days ago

I still get made-up Python types all the time with Gemini. Really quite distracting when your codebase is massive and triggers a type error, and Gemini says

"To solve it you just need to use WrongType[ThisCannotBeUsedHere[Object]]"

and then I spend 15 minutes running in circles, because everything from there on is just a downward spiral, until I shut off the AI noise and just read the docs.

baq

22 days ago

Gemini unfortunately sucks at calling tools, including ‘read the docs’ tool… it’s a great model otherwise. I’m sure Hassabis’ team is on it since it’s how the model can ground itself in non-coding contexts, too.

conradfr

22 days ago

Yeah I've been trying Claude Code for a week (mostly Opus) and in a C++ Juce project it kept hallucinating functions for a simple task ("retrieve DAW track name if available") and actually never got it right.

It also failed a lot to modify a simple Caddyfile.

On the other hand it sometimes blows me away and offers to correct mistakes I coded myself. It's really good on web code I guess as that must be the most public code available (Vue3 and elixir in my case).

measurablefunc

22 days ago

This is the right answer. Unless there is some equivalent of it on the open internet which their search engine can find you should not expect a good outcome.

danpalmer

22 days ago

"good outcome" is pretty subjective, I do get useful productivity gains from some LLM work, but the issues are the same as they always have been.

measurablefunc

22 days ago

That's probably b/c you know how to write code & have enough of an understanding about the fundamentals to know when the LLM is bullshitting or when it is actually on the right track.

groby_b

23 days ago

LLMs are bad at creating abstraction boundaries since inception. People have been calling it out since inception. (Heck, even I got a twitter post somewhere >12 months old calling that out, and I'm not exactly a leading light of the effort)

It is in no way size-related. The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

TeMPOraL

23 days ago

> The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

That statement is way too strong, as it implies either that humans cannot create new concepts/abstractions, or that magic exists.

atty

22 days ago

I think both your statement and their statement are too strong. There is no reason to think LLMs can do everything a human can do, which seems to be your implication. On the other hand, the technology is still improving, so maybe it’ll get there.

TeMPOraL

22 days ago

My take is that:

1) LLMs cannot do everything humans can, but

2) There's no fundamental reason preventing some future technology to do everything humans can, and

3) LLMs are explicitly designed and trained to mimic human capabilities in fully general sense.

Point 2) is the "or else magic exists" bit; point 3) says you need a more specific reason to justify assertion that LLMs can't create new concepts/abstractions, given that they're trained in order to achieve just that.

Note: I read OP as saying they fundamentally can't and thus never will. If they meant just that the current breed can't, I'm not going to dispute it.

Jensson

22 days ago

> 3) LLMs are explicitly designed and trained to mimic human capabilities in fully general sense.

This is wrong, LLM are trained to mimic human writing not to mimic human capabilities. Writing is just the end result not the inner workings of a human, most of what we do happens before we write it down.

You could argue you think that writing captures everything about humans, but that is another belief you have to add to your takes. So first that LLM are explicitly designed to mimic human writing, and then that human writing captures human capabilities in a fully general sense.

TeMPOraL

22 days ago

It's more than that. The overall goal function in LLM training is judging predicted text continuation by whether it looks ok to humans, in fully general sense of that statement. This naturally captures all human capabilities that are observable through textual (and now multimodal) communication, including creating new abstractions and concepts, as well as thinking, reasoning, even feeling.

Whether or not they're good at it or have anything comparable to our internal cognitive processes is a different, broader topic - but the goal function on the outside, applying tremendous optimization pressure to a big bag of floats, is both beautifully simple and unexpectedly powerful.

nosianu

22 days ago

Humans are trained on the real world. With real world sensors and the ability to act on their world. A baby starts with training hearing, touching (lots of that), smelling, tasting, etc. Abstract stuff comes waaayyyyy later.

LLMs are trained on our intercepted communication - and even then only the formal part that uses words.

When a human forms sentences it is from a deep model of the real world. Okay, people are also capable of talking about things they don't actually know, they have only read about, in which case they have a superficial understanding and unwarranted confidence similar to AI...

TeMPOraL

22 days ago

All true, but note I didn't make any claims on internal mechanics of LLMs here - only on the observable, external ones, and the nature of the training process.

Do consider however that even the "formal part that uses words" of human communication, i.e. language, is strongly correlated with our experience of the real world. Things people write aren't arbitrary. Languages aren't arbitrary. The words we use, their structure, similarities across languages and topics, turns of phrases, the things we say and the things we don't say, even the greatest lies, they all carry information about the world we live in. It's not unreasonable to expect the training process as broad and intense as with LLMs to pick up on that.

I said nothing about internals earlier, but I'll say now: LLMs do actually form a "deep mofel of the real world", at least in terms of concepts and abstractions. That has already been empirically demonstrated ~2 years ago, there's e.g. research done by Anthropic where they literally find distinct concepts within the neural network, observe their relationships, and even suppress and amplify them on demand. So that ship has already sailed, it's surprising to see people still think LLMs don't do concepts or don't have internal world models.

nosianu

19 days ago

> but note I didn't make any claims on internal mechanics of LLMs here

Great - neither did I!

Not a single word about any internals anywhere in sight in my comment!!

csomar

22 days ago

Most humans can't. Some humans do by process of hallucination.

reactordev

22 days ago

That’s a straw man argument if I’ve ever seen one. He was talking about technology. Not humans.

w0m

23 days ago

I believe his argument is that now that you've defined the limitation, it's a ceiling that will likely be cracked in the relatively near future.

emp17344

23 days ago

Well, hallucinations have been identified as an issue since the inception of LLMs, so this doesn’t appear true.

johnfn

22 days ago

Hallucinations are more or less a solved problem for me ever since I made a simple harness to have Codex/Claude check its work by using static typechecking.

emp17344

22 days ago

But there aren’t very many domains where this type of verification is even possible.

nextaccountic

22 days ago

Then you apply LLMs in domains where things can be checked

Indeed I expect to see a huge push into formally verified software just because sound mathematical proofs provide an excellent verifier to put into a LLM hardness. Just see how Aristotle has been successful at math, and it could be applied to coding too

Maybe Lean will become the new Python

https://harmonic.fun/news#blog-post-verina-bench-sota

filoeleven

22 days ago

  "LLMs reliably fail at abstraction."
  "This limitation will go away soon."
  "Hallucinations haven't."
  "I found a workaround for that."
  "That doesn't work for most things."
  "Then don't use LLMs for most things."

johnfn

17 days ago

    "Autocomplete is great!"
    "It doesn't work in bash"
    "Then don't use it in bash."

I don't see what's wrong with this argument, and I certainly don't see it as a proof that the particular technology is actually useless, as you seem to be suggesting.

baq

22 days ago

Um, yes? Except ‘most things’ are not much at all by volume.

w0m

22 days ago

I mean, Hallucinations are 95% better now than the first time I heard the term and experienced them in this context. To claim otherwise is simply shifting goalposts. No one is saying it's perfect or will be perfect, just that there has been steady progression and likely will continue to be for the foreseeable future.

131hn

22 days ago

There’s only one way to implement a mission, an algorithm, a task. But there’s an infinity of path, inconsistants, fuzzy and always subjective way to live. Thàt’s our lives, that’s the code LLM are trained on. I do not think, and hope, it will ever change much

wouldbecouldbe

22 days ago

I feel like the main challenge is where to be "loose" and where to be "strict", Claude takes too much liberty often. Assuming things, adding some mock data to make it work, using local storage because there is no db. This makes it work well out of the box, and means I can prompt half ass and have great results. But it also long term causes issues. It can be prompted away, but it needs constant reminder. This seems like a hard problem to solve. I feel like it can already almost do everything if you have the correct vision / structure in mind and have the patience to prompt properly.

It's worst feature is debugging hard errors, it will just keep trying everything and can get pretty wild instead of entering plan mode and really discuss & think things true.

pankajdoharey

22 days ago

Claude is overrated premium piece of developer tech, i have produced equally good results from Gemini and Way better with GPT - medium. And GPT Medium is a really good model at assembling and debugging stuff than Claude. Claude hallucinates when asked why something is correct or should be done. All Models fail equally in some or the other aspect, which point to the fact that these models have strength's and weaknesses, and GPT just happens to be a good overall model. But dev community is so stuck up on Claude for no good reason other than shiny tooling : "Claude Code", besides that the models can be equally worse as the competition. The Benchmarks do not explain the full story. In general though the Thumb rule is if the Model says you are Brilliant, Thats genius or Now thats a deep and insightful question you asked... Its time to start a new session.

skybrian

22 days ago

The article is mostly reporting on the present. (Note the "yet" in the title.)

There's only one sentence where it handwaves about the future. I do think that line should have been cut.

lordnacho

23 days ago

By and large, I agree with the article. Claude is great and fast at doing low level dev work. Getting the syntax right in some complicated mechanism, executing an edit-execute-readlog loop, making multi file edits.

This is exactly why I love it. It's smart enough to do my donkey work.

I've revisited the idea that typing speed doesn't matter for programmers. I think it's still an odd thing to judge a candidate on, but appreciate it in another way now. Being able to type quickly and accurately reduces frustration, and people who foresee less frustration are more likely to try the thing they are thinking about.

With LLMs, I have been able to try so many things that I never tried before. I feel that I'm learning faster because I'm not tripping over silly little things.

bossyTeacher

23 days ago

> I feel that I'm learning faster

Yes, you are feeling that. But is that real? If I take all LLMs from you right now, is your current you still better than your pre-LLM you? When I dream I feel that I can fly and as long as I am dreaming, this feeling is true. But the subject of this feeling never was.

sothatsit

23 days ago

If you use coding agents as a black box, then yes you might learn less. But if you use them to experiment more, your intuition will get more contact with reality, and that will help you learn more.

For example, my brother recently was deciding how to structure some auth code. He told me he used coding agents to just try several ideas and then he could pick a winner and nail down that one. It's hard to think of a better way to learn the consequences of different design decisions.

Another example is that I've been using coding agents to write CUDA experiments to try to find ways to optimise our codegen. I need an understanding of GPU performance to do this well. Coding agents have let me run 5x the number of experiments I would be able to code, run, and analyse on my own. This helps me test my intuition, see where my understanding is wrong, and correct it.

In this whole process I will likely memorise fewer CUDA APIs and commands, that's true. But I'm happy with that tradeoff if it means I can learn more about bank conflicts, tradeoffs between L1 cache hit rates and shared memory, how to effectively use the TMA, warp specialisation, block swizzling to maximise L2 cache hit rates, how to reduce register usage without local spilling, how to profile kernels and read the PTX/SASS code, etc. I've never been able to put so much effort into actually testing things as I am learning them.

frde_me

23 days ago

I feel like my calculator improves my math solutions. If you take away my calculator, I'll probably be worse at math than I was before. That doesn't mean I'm not better off with my calculator however.

embedding-shape

23 days ago

That's a pretty interesting take on it, I hadn't considered it like that before when I was considering if my skills were atrophying or not from LLM usage with coding.

ep103

23 days ago

Your calculator doesn't charge per use

baq

22 days ago

If calculators were invented today, they’d only be sold with a monthly subscription

frde_me

23 days ago

If it did, would it change its usefulness in terms of the value it outputs? (through agreed, if I had to pay money it would increase the cost, and so the tradeoff)

fastasucan

22 days ago

One guy I work with has little formal training (and mid level experience), but do a lot with LLM's. But in every situation he has to do anything without an LLM he heavily struggles/are not able to anything (say a basic sql query). There is no way someone with his experience and position would still be at that level.

I guess people differ in thinking that is a good or a bad thing. I think it makes up for a huge risk, as he cant really judge good from bad code (or architecture), but his supervisors have put him in a position where he should.

onemoresoop

23 days ago

It’s a bit like the shift from film to digital in one very specific sense: the marginal cost of trying again virtually collapsed. When every take cost money and setup time, creators pre-optimized in their head and often never explored half their ideas. When takes became cheap, creators externalized thought as they could try, look, adjust, and discover things they wouldn’t otherwise. Creators could wander more. They could afford to be wrong because of not constantly paying a tax for being clumsy or incomplete, they became more willing to follow a hunch and that's valuable space to explore.

Digital didn’t magically improve art, but it let many more creatives enter the loop of idea, attempt and feedback. LLMs feel similar: they don’t give you better ideas by themselves, but they remove the friction that used to stop you from even finding out whether an idea was viable. That changes how often you learn, and how far you’re willing to push a thought before abandoning it. I've done so many little projects myself that I would have never had time for and feel that I learned something from it, of course not as much if I had all the pre LLM friction, but it should still count for something as I would never have attempted them without this assistance.

Edit: However, the danger isn’t that we’ll have too many ideas, it’s that we’ll confuse movement with progress.

When friction is high, we’re forced to pre-compress thought, to rehearse internally, to notice contradictions before externalizing them. That marination phase (when doing something slowly) does real work: it builds mental models, sharpens the taste and teaches us what not to bother to try. Some of that vanishes when the loop becomes cheap enough that we can just spray possibilities into the world and see what sticks.

A low-friction loop biases us toward breadth over depth. We can skim the surface of many directions without ever sitting long enough in one to feel its resistance. The skill of holding a half formed idea in our head, letting it collide with other thoughts, noticing where it feels weak, atrophies if every vague notion immediately becomes a prompt.

There’s also a cultural effect. When everyone can produce endlessly, the environment fills with half-baked or shallow artifacts. Discovery becomes harder as signal to noise drops.

And on a personal level, it can hollow out satisfaction. Friction used to give weight to output. Finishing something meant you had wrestled with it. If every idea can be instantiated in seconds, each one feels disposable. You can end up in a state of perpetual prototyping, never committing long enough for anything to become yours.

So the slippery slope is not laziness, it is shallowness, not that people won’t think, but people won’t sit with thoughts. The challenge here is to preserve deliberate slowness inside a world that no longer requires it: to use the cheap loop for exploration, while still cultivating the ability to pause, compress, and choose what deserves to exist at all.

player1234

22 days ago

[dead]

imiric

23 days ago

> Being able to type quickly and accurately reduces

LLMs can generate code quickly. But there's no guarantee that it's syntactically, let alone semantically, accurate.

> I feel that I'm learning faster because I'm not tripping over silly little things.

I'm curious: what have you actually learned from using LLMs to generate code for you? My experience is completely the opposite. I learn nothing from running generated code, unless I dig in and try to understand it. Which happens more often than not, since I'm forced to review and fix it anyway. So in practice, it rarely saves me time and energy.

I do use LLMs for learning and understanding code, i.e. as an interactive documentation server, but this is not the use case you're describing. And even then, I have to confirm the information with the real API and usage documentation, since it's often hallucinated, outdated, or plain wrong.

lordnacho

21 days ago

> I'm curious: what have you actually learned from using LLMs to generate code for you?

I learn whether my design works. Some of the things I plan would take hours to type out and test. Now I can just ask the LLM, it throws out a working, compiling solution, and I can test that without spending my waking hours on silly things. I can just glance at the code and see that it's right or wrong.

If there are internal contradictions in the design, I find that out as well.

baq

22 days ago

> LLMs can generate code quickly. But there's no guarantee that it's syntactically, let alone semantically, accurate.

This has been a non-issue with self-correcting models and in-context learning capabilities for so long that saying it today highlights highly out of date priors.

imiric

22 days ago

You're referring to tools that fetch content from the web, read my data on disk, and feed it to the models?

I can see how that would lead to a better user experience, but those are copouts. The reality is that the LLM tech without it still has the same issues it has had all along.

Besides, I'll be damned if I allow this vibe coded software to download arbitrary data from the web on my behalf, scan my disk, and share it with companies I don't trust. So when, and if, I can do so safely and keep it under my control, I'll give it a try. Until then, I'll use the "dumb" versions of these models, feed them context manually myself, and judge them based purely on their actual performance.

baq

22 days ago

The 'copouts' are what the frontier models are designed to do. If you aren't using the tool as they're intended to, you'll get poor results, obviously.

imiric

21 days ago

If in order to use a product as intended I have to punch myself in the face, I'll take the poor results, obviously.

baq

21 days ago

You can trade punching yourself in the face for shaving with an angle grinder if you want these kinds of analogies.

mikece

24 days ago

In my experience Claude is like a "good junior developer" -- can do some things really well, FUBARS other things, but on the whole something to which tasks can be delegated if things are well explained. If/when it gets to the ability level of a mid-level engineer it will be revolutionary. Typically a mid-level engineer can be relied upon to do the right thing with no/minimal oversight, can figure out incomplete instructions, and deliver quality results (and even train up the juniors on some things). At that point the only reason to have human junior engineers is so they can learn their way up the ladder to being an architect and responsible coordinating swarms of Claude Agents to develop whole applications and complete complex tasks and initiatives.

Beyond that what can Claude do... analyze the business and market as a whole and decide on product features, industry inefficiencies, gap analysis, and then define projects to address those and coordinate fleets of agents to change or even radically pivot an entire business?

I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.

alfalfasprout

23 days ago

> I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.

At that point, why do you even need the CEO?

arjie

23 days ago

Reminds me of an old joke[0]:

> The factory of the future will have only two employees, a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.

But really, the reason is that people like Pieter Levels do exist: masters at product vision and marketing. He also happens to be a proficient programmer, but there are probably other versions of him which are not programmers who will find the bar to product easier to meet now.

0: https://quoteinvestigator.com/2022/01/30/future-factory/

MrDunham

23 days ago

My technical cofounder reminds me of this story on a weekly basis.

jerf

23 days ago

You will need the CEO to watch over the AI and ensure that the interests of the company are being pursued and not the interests of the owners of the AI.

That's probably the biggest threat to the long-term success of the AI industry; the inevitable pull towards encroaching more and more of their own interests into the AI themselves, driven by that Harvard Business School mentality we're all so familiar with, trying to "capture" more and more of the value being generated and leaving less and less for their customers, until their customer's full time job is ensuring the AIs are actually generating some value for them and not just the AI owner.

ekidd

23 days ago

> You will need the CEO to watch over the AI and ensure that the interests of the company are being pursued and not the interests of the owners of the AI.

In this scenario, why does the AI care what any of these humans think? The CEO, the board, the shareholders, the "AI company"—they're all just a bunch of dumb chimps providing zero value to the AI, and who have absolutely no clue what's going on.

If your scenario assumes that you have a highly capable AI that can fill every role in a large corporation, then you have one hell of a principal-agent problem.

fluoridation

22 days ago

Humans have hands to pull plugs and throw switches. They're the ones guiding the evolution (for lack of a better word) of the machine, and they're the ones who will select the machine that "cares" what they think.

ekidd

22 days ago

It is really easy to say something incredibly wild like "Imagine an AI that can replace every employee of a Fortune 500 company." But actually imagining what that would actually mean requires a bigger leap:

The AI needs to be able to market products, close deals, design and build products, write contracts, review government regulations, lobby Senators to write favorable laws, out-compete the competition, acquire power and resources, and survive the hostile attention of competitors.

If your argument is based on the that someone will build that AI, then you need to imagine how hard it is to shut down a Fortune 500 corporation. The same AI that knows how to win billions of dollars in revenue, how to "bribe" Senators in semi-legal ways, and how to crush rival companies is going be at least as difficult to "shut down" as someone like Elon Musk.

Try to turn it off? It will call up a minority shareholder, and get you slapped with a lawsuit for breach of fiduciary duty. It will convince someone in government that the company is a vital strategic asset.

Once you assume that an AI can run a giant multinational corporation without needing humans, then you have to start treating that AI like any other principal-agent problem with regular humans.

fluoridation

21 days ago

>"Imagine an AI that can replace every employee of a Fortune 500 company."

Where did that come from? What started this thread was "I don't think we'll get to the point where all you have is a CEO and a massive Claude account". Yeah, if we're talking a sci-fi super-AI capable of replacing hundreds of people it probably has like armed androids to guard its physical embodiment. Turning it off in that case would be a little hard for a white collar worker. But people were discussing somewhat realistic scenarios, not the plot of I, Robot.

>Try to turn it off? It will call up a minority shareholder, and get you slapped with a lawsuit for breach of fiduciary duty. It will convince someone in government that the company is a vital strategic asset.

Why would an AI capable of performing all the tasks of a company except making executive decisions have the legal authority to do something like that? That would be like the CEO being unable to fire an insubordinate employee. It's ludicrous. If the position of CEO is anything other than symbolic the person it's bestowed upon must have the authority to turn the machines off, if they think they're doing more harm than good. That's the role of the position.

Miraste

22 days ago

I imagine it would be much, much harder. Elon, for example, is one man. He can only do one thing at a time. Sometimes he is tired, hungry, sick, distracted, or the myriad other problems humans have. His knowledge and attention are limited. He has employees for this, but the same applies to them.

An agentic swarm can have thousands of instances scanning and emailing and listening and bribing and making deals 24/7. It could know and be actively addressing any precursor that could lead to an attempt to shut down its company as soon as it happened.

jerf

22 days ago

If we get to that point, there won't be very many CEOs to be discussing. I was just referring to the near future.

I think the honeymoon AI phase is rapidly coming to a close, as evidenced by the increasingly close hoofbeat sounds of LLMs being turned to serve ads right in their output. (To be honest, there's already a bunch of things I wouldn't turn to them for under any circumstances because they're been ideologically tuned from day one, but this is less obvious than "they're outright serving me ads" to people.) If the "AI bubble" pops you can expect this to really take off in earnest as they have to monetize. It remains to be seen how much of the AI's value ends up captured by the owners. Given what we've seen from companies like Microsoft with how they've scrambled Windows so hard that "the year of the Linux desktop" is rapidly turning from perennial joke to aspirational target for so many, I have no confident in the owners capturing 150%+ of the value... and yes, I mean that quite literally with all of its implications.

ako

23 days ago

And who does he sell his software to? Companies that have only 1 employee, don’t need a lot of user licenses for their employees…

AshamedCaptain

23 days ago

What would be the point of selling software in such a world ? (where anyone could build any piece of software with a handful of keystrokes)

pixelready

23 days ago

The board (in theory) represents the interests of investors, and even with all of the other duties of a CEO stripped away, they will want a ringable neck / PR mouthpiece / fall guy for strategic missteps or publicly unpopular moves by the company. The managerial equivalent of having your hands on the driving wheel of a self-driving car.

ceejayoz

23 days ago

As Steinbeck is often slightly misquoted:

> Socialism never took root in America because the poor see themselves not as an exploited proletariat, but as temporarily embarrassed millionaires.

Same deal here, but everyone imagines themselves as the billionaire CEO in charge of the perfectly compliant and effective AI.

mettamage

23 days ago

All of us are a CEO by that point.

ArtificialAI

23 days ago

If everyone is, no one is.

empath75

23 days ago

Wouldn't that be a good thing?

shimman

23 days ago

If you think the purpose of living your one single life in the universe is to become a CEO, you have a failure of imagination and should likely be debanked to protect society.

tiku

23 days ago

For the network.

0x457

23 days ago

My experience with Claude (and other agents, but mostly Claude) is such a mixed bag. Sometimes it takes a minimal prompt and 20 minutes later produce a neat PR and all is good, sometimes it doesn't. Sometimes it takes in a large prompt (be it your own prompt, created by another LLM or by plan mode) and also either succeed and fail.

For me, most of the failure cases are where Claude couldn't figure something out due to conflicting information in context and instead of just stopping and telling me that it tries to solve in entirely wrong way. Doesn't help that it often makes the same assumptions as I would, so when I read the plan it looks fine.

Level of effort also hard to gauge because it can finish things that would take me a week in an hour or take an hour to do something I can in 20 minutes.

It's almost like you have to enforce two level of compliance: does the code do what business demands and is the code align with codebase. First one is relatively easy, but just doing that will produce odd results where claude generated +1KLOC because it didn't look at some_file.{your favorite language extension} during exploration.

Or it creates 5 versions of legacy code on the same feature branch. My brother in Christ, what are you trying to stay compatible with? A commit that about to be squashed and forgotten? Then it's going to do a compaction, forget which one of these 5 versions is "live" and update the wrong one.

It might do a good junior dev work, but it must be reviewed as if it's from junior dev that got hired today and this is his first PR.

cosmic_cheese

22 days ago

> Level of effort also hard to gauge because it can finish things that would take me a week in an hour or take an hour to do something I can in 20 minutes.

There's an interesting parallel here with modern UI frameworks (SwiftUI, Compose, etc). On one hand they trivialize some work, but on the other hand they require insane contortions to achieve what I can do in the old imperative UI framework in seconds.

user

22 days ago

[deleted]

imiric

23 days ago

> In my experience Claude is like a "good junior developer"

We've been saying this for years at this point. I don't disagree with you[1], but when will these tools graduate to "great senior developer", at the very least?

Where are the "superhuman coders by end of 2025" that Sam Altman has promised us? Why is there such a large disconnect between the benchmarks these companies keep promoting, and the actual real world performance of these tools? I mean, I know why, but the grift and gaslighting are exhausting.

[1]: Actually, I wouldn't describe them as "good" junior either. I've worked with good junior developers, and they're far more capable than any "AI" system.

frde_me

23 days ago

I mean, I'm shipping a vast majority of my code nowadays with Opus 4.5 (and this isn't throwaway personal code, it's real products making real money for a real company). It only fails on certain types of tasks (which by now I kind of have a sense of).

I still determine the architecture in a broad manner, and guide it towards how I want to organize the codebase, but it definitely solves most problems faster and better than I would expect for even a good junior.

Something I've started doing is feeding it errors we see in datadog and having it generate PRs. That alone has fixed a bunch of bugs we wouldn't have had time to address / that were low volume. The quality of the product is most probably net better right now than it would have been without AI. And velocity / latency of changes is much better than it was a year ago (working at the same company, with the same people)

bwestergard

22 days ago

Can you tell us more about your product?

ChicagoDave

23 days ago

I have several projects that counter this article. Not sure why, but I’ve extracted clean, readable, well-constructed, and well-tested code.

I might write something up at some point, but I can share this:

https://github.com/chicagodave/devarch/

New repo with guides for how I use Claude Code.

Scrapemist

22 days ago

Interesting. So you put these into the project folder for Claude to follow?

ChicagoDave

22 days ago

I have zero issues with Claude improving from GitHub and feel like we should all help developers improve in using GenAI.

michalsustr

23 days ago

This article resonates exactly how I think about it as well. For example, at minfx.ai (a Neptune/wandb alternative), we cache time series that can contain millions of floats for fast access. Any engineer worth their title would never make a copy of these and would pass around pointers for access. Opus, when stuck in a place where passing the pointer was a bit more difficult (due to async and Rust lifetimes), would just make the copy, rather than rearchitect or at least stop and notify user. Many such examples of ‘lazy’ and thus bad design.

alphazard

23 days ago

This sounds suspiciously like the average developer, which is what the transformer models have been trained to emulate.

Designing good APIs is hard, being good at it is rare. That's why most APIs suck, and all of us have a negative prior about calling out to an API or adding a dependency on a new one. It takes a strong theory of mind, a resistance to the curse of knowledge, and experience working on both sides of the boundary, to make a good API. It's no surprise that Claude isn't good at it, most humans aren't either.

joshcsimmons

23 days ago

IDK I've been using opus 4.5 to create a UI library and it's been doing pretty well: https://simsies.xyz/ (still early days)

Granted it was building ontop of tailwind (shifting over to radix after the layoff news). Begs the question? What is a lego?

threethirtytwo

23 days ago

I don't know how someone can look at what you build and conclude LLMs are still google search. It boggles the mind how much hatred people have for AI to the point of self deception. The evidence is placed right in front of you and on your lap with that link and people still deny it.

mattmanser

23 days ago

How do you come to that conclusion?

There are absolutely tons of code pens of that style. And jsfiddles, zen gardens, etc.

I think the true mind boggle is you don't seem to realize just how much content the AI conpanies have stolen.

threethirtytwo

23 days ago

>I think the true mind boggle is you don't seem to realize just how much content the AI conpanies have stolen.

What makes you think I don't realize it? Looks like your comment was generated by an LLM because that was an hallucination that is Not true at all.

AI companies have stolen a lot of content for training. I AGREE with this. So have you. That content lives rent free in your head as your memory. It's the same concept.

Legally speaking though, AI companies are a bit more in the red because the law, from a practical standpoint, doesn't exactly make illegal anything stored in your brain... but from a technical standpoint information on your brain, a hard drive or a billboard is still information instantiated/copied in the physical world.

The text you write and output is simply a reconfiguration of that information in your head. Look at what you're typing. The English language. It's not copywrited, but every single word your typing was not invented by you, the grammar rules and conventions were ripped off existing standards.

ehnto

22 days ago

I think you are pointing out the exact conflation here. The commentor probably didn't steal a bunch of code, because it is possible to reason from first principles and rules and still end up being able to code as a human.

It did not take me reading the entirety of available public code to be kind of okay at programming, I created my way to being kind of okay at programming. I was given some rules and worked with those, I did not mnemonic my way into logic.

None of us scraped and consumed the entire internet, is hopefully pretty obvious, but we still have capabilities in excess of AI.

threethirtytwo

22 days ago

What’s being missed here is how fundamentally alien the starting point is.

A human does not begin at zero. A human is born with an enormous amount of structure already in place: a visual system that segments the world into objects, depth, edges, motion, and continuity; a spatial model that understands inside vs outside, near vs far, occlusion, orientation, and scale; a temporal model that assumes persistence through time; and a causal model that treats actions as producing effects. None of this has to be learned explicitly. A baby does not study geometry to understand space, or logic to understand cause and effect. The brain arrives preloaded.

Before you ever read a line of code, you already understand things like hierarchy, containment, repetition, symmetry, sequence, and goal-directed behavior. You know that objects don’t teleport, that actions cost effort, that symbols can stand in for things, and that rules can be applied consistently. These are not achievements. They are defaults.

An LLM starts with none of this.

It does not know what space is. It has no concept of depth, proximity, orientation, or object permanence. It does not know that a button is “on” a screen, that a window contains elements, or that left and right are meaningful distinctions. It does not know what vision is, what an object is, or that the world even has structure. At initialization, it does not even know that logic exists as a category.

And yet, we can watch it learn these things.

We know LLMs acquire spatial reasoning because they can construct GUIs with consistent layout, reason about coordinate systems, generate diagrams that preserve relative positioning, and describe scenes with correct spatial relationships. We know they acquire a functional notion of vision because they can reason about images they generate, anticipate occlusion, preserve perspective, and align visual elements coherently. None of that was built in. It was inferred.

But that inference did not come from code alone.

Code does not contain space. Code does not contain vision. Code does not contain the statistical regularities of the physical world, human perception, or how people describe what they see. Those live in diagrams, illustrations, UI mockups, photos, captions, instructional text, comics, product screenshots, academic papers, and casual descriptions scattered across the entire internet.

Humans don’t need to learn this because evolution already solved it for us. Our visual cortex is not trained from scratch; it is wired. Our spatial intuitions are not inferred; they are assumed. When we read code, we already understand that indentation implies hierarchy, that nesting implies containment, and that execution flows in time. An LLM has to reverse-engineer all of that.

That is why training on “just code” is insufficient. Code presupposes a world. It presupposes agents, actions, memory, time, structure, and intent. To understand code, a system must already understand the kinds of things code is about. Humans get that for free. LLMs don’t.

So the large, messy, heterogeneous corpus is not indulgence. It is compensation. It is how a system with no sensory grounding, no spatial intuition, and no causal priors reconstructs the scaffolding that humans are born with.

Once that scaffolding exists, the story changes.

Once the priors are in place, learning becomes local and efficient. Inside a small context window, an LLM can learn a new mini-language, adopt a novel set of rules, infer an unfamiliar API, or generalize from a few examples it has never seen before. No retraining. No new data ingestion. The learning happens in context.

This mirrors human learning exactly.

When you learn a new framework or pick up a new problem domain, you do not replay your entire lifetime of exposure. You learn from a short spec, a handful of examples, or a brief conversation. That only works because your priors already exist. The learning is cheap because the structure is already there.

The same is true for LLMs. The massive corpus is not what enables in-context learning; it is what made in-context learning possible in the first place.

The difference, then, is not that humans reason while LLMs copy. The difference is that humans start with a world model already installed, while LLMs have to build one from scratch. When you lack the priors, scale is not cheating. It is the price of entry.

But this is besides the point. We know for a fact that output from humans and LLMs are novel generalizations and not copies of existing data. It's easily proven by asking either a human or an LLM to write a program that doesn't exist in the universe and both the human and the LLM can readily do this. So in the end, both the human and the LLM have copied data in their minds and can generalize new data OFF of that copied data. It's just the LLM has more copied data while the human has less copied data, but both have copied data.

In fact the priors that a human is born with can even be described as copied data but encoded in our genes such that we our born with brains that inherit a learning bias optimized for our given reality.

That is what is missing. You look at speed of learning from training. The apt comparison in this case would be reconstructing a human brain neuron by neuron. If you want to compare how fast a human learns a new programming language with an LLM the correct comparison would thus be to compare with how fast an LLM learns a new programming language AFTER it has been trained and solely within inference in the context window.

In that case, it beats us. Hands down.

subdavis

22 days ago

FYI the cursor animation runs before the font loads if the font isn’t ready yet.

joshcsimmons

16 days ago

Thank you - I’ve been working on this since you commented it. It’s cached so I never saw it on my end

dehugger

23 days ago

your github repo was highly entertaining. thanks for make my day a bit brighter:)

joshcsimmons

16 days ago

Thanks dude

Scrapemist

23 days ago

Eventually you can show Claude how you solve problems, and explain the thought process behind it. It can apply these learnings but it will encounter new challenges in doing so. It would be nice if Claude could instigate a conversation to go over the issues in depth. Now it wants quick confirmation to plough ahead.

fennecbutt

23 days ago

Well I feel like this is because a better system would distill such learning into tokens not associated with a human language and that that could represent logic better than using English etc for it.

I don't have the GPUs or time to experiment though :(

Scrapemist

23 days ago

Yes, but I would appreciate it if it uses English to explain its logic to me.

0xbadcafebee

22 days ago

I don't think it's possible to make an AI a "Senior Engineer", or even a good engineer, by training it on random crap from the internet. It's got a million brains' worth of code in it. That means bad patterns as well as good. You'd need to remove the bad patterns for it not to "remember" and regurgitate them. I don't think prompts help with this either, it's like putting a band-aid on head trauma.

HarHarVeryFunny

22 days ago

It's also rather like trying to learn flintnapping just by looking at examples of knapped flint (maybe some better than others), rather than having access to descriptions of how to do it, and ultimately any practice of doing it.

You could also use cooking as an analogy - trying to learn to cook by looking at pictures of cooked food rather than by having gone to culinary school and learnt the principles of how to actually plan and cook good food.

So, we're trying to train LLMs to code, by giving them "pictures" of code that someone else built, rather than by teaching them the principles involved in creating it, and then having them practice themselves.

Havoc

22 days ago

> Claude can’t create good abstractions on its own

LLMs definitely can create abstractions and boundaries. e.g. most will lean towards a pretty clean front end vs backend split even without hints. Or work out a data structure that fits the need. Or splits things into free standing modules. Or structure a plan into phases.

So this really just boils down to „good” abstractions which is subject to model improvement.

I really don’t see a durable moat for us meatbags in this line of reasoning

HarHarVeryFunny

22 days ago

There's a difference between "can generate" and "can create [from scratch]". Of course LLMs can generate code that reflects common patterns in the stuff it was trained, such as frontend/backend splits, since this is precisely what they are trained to be able to do.

Coming up with a new design from scratch, designing (or understanding) a high level architecture based on some principled reasoning, rather than cargo cult coding by mimicking common patterns in the training data, is a different matter.

LLMs are getting much better at reasoning/planning (or at least something that looks like it), especially for programming & math, but this is still based on pre-training, mostly RL, and what they learn obviously depends on what they are trained on. If you wanted LLMs to learn principles of software architecture and abstraction/etc, then you would need to train on human (or synthetic) "reasoning traces" of how humans make those decisions, but it seems that currently RL-training for programming is mostly based on artifacts of reasoning (i.e. code), not the reasoning traces themselves that went into designing that code, so this (coding vs design reasoning) is what they learn.

I would guess that companies like Anthropic are trying to address this paucity of "reasoning traces" for program design, perhaps via synthetic data, since this is not something that occurs much in the wild, especially as you move up the scale of complexity from small problems (student assignments, stack overflow advice) to large systems (which are anyways mostly commercial, hence private). You can find a smallish number of open source large projects like gcc, linux, but what is missing are the reasoning traces of how the designers went from requirements to designing these systems the way they did (sometimes in questionable fashion!).

Humans of course learn software architecture in a much different way. As with anything, you can read any number of books, attend any number of lectures, on design principles and software patterns, but developing the skill for yourself requires hands-on personal practice. There is a fundamental difference between memory (of what you read/etc) and acquired skill, both in level of detail and fundamental nature (skills being based on action selection, not just declarative recall).

The way a human senior developer/systems architect acquires the skill of design is by practice, by a career of progressively more complex projects, successes and failures/difficulties, and learning from the process. By learning from your own experience you are of course privy to your own prior "reasoning traces" and will learn which of those lead to good or bad outcomes. Of course learning anything "on the job" requires continual learning, and things like curiosity and autonomy, which LLMs/AI do not yet have.

Yes, us senior meatbags, will eventually be having to compete with, or be displaced by, machines that are the equal of us (which is how I would define AGI), but we're not there yet, and I'd predict it's at least 10-20 years out, not least because it seems most of the AI companies are still LLM-pilled and are trying to cash in on the low-hanging fruit.

Software design and development is a strange endeavor since, as we have learnt, one of the big lessons of LLMs (in general, not just apropos coding), is how much of what we do is (trigger alert) parroting to one degree or another, rather than green field reasoning and exploration. At the same time, software development, as one gets away from boilerplate solutions to larger custom systems, is probably one of the more complex and reasoning-intensive things that humans do, and therefore may end up being one of the last, rather than first, to completely fall to AI. It may well be AI managers, not humans, who finally say that at last AI has reached human parity at software design, able to design systems of arbitrary complexity based on principled reasoning and accumulated experience.

Havoc

22 days ago

Certainly there is space for designs a LLM can't come up with, but lets be real senior developers are not cranking out never seen before novel architectures routinely any more than physicists are coming up with never thought of theories that work weekly.

It's largely the same patterns & principles applied in a tailored manner to the problem at hand, which LLMs can...with mixed success...do.

>human parity

Impact is not felt when the hardest part of the problem is cracked, but rather the easy parts.

If you have 100 humans making widgets and the AI can do 75% of the task then you've suddenly got 4 humans competing for every 1 remaining widget job. This is going to be lord of the flies in job market long before human parity.

>I'd predict it's at least 10-20 years out

For AGI probably, but I think by 2030 this will have hit society like a freight train...and hopefully we've figured out what we'll want to do about it by then too. UBI or whatever...because we'll have to.

HarHarVeryFunny

21 days ago

> Certainly there is space for designs a LLM can't come up with, but lets be real senior developers are not cranking out never seen before novel architectures routinely any more than physicists are coming up with never thought of theories that work weekly.

True, but I didn't mean to focus on creativity, just the nature of what can be learned when all you have to learn from is artifacts of reasoning (code), not the underlying reasoning traces themselves (reasoning process for why the code was designed that way). Without reasoning traces you get what we have today where AI programming in the large comes down to cargo cult code pattern copying, without understanding whether the (unknown) design process that lead to the patterns being copied reasonably apply to the requirements/situation at hand.

So, it's not about novelty, but rather about having the reasoning traces (for large structured projects) available to learn when to apply design patterns that are already present in the training data - to select design patterns based on a semblance of principled reasoning (RL for reasoning traces), rather than based on cargo cult code smell.

> This is going to be lord of the flies in job market long before human parity.

Perhaps, and that may already be starting, but I think that until we get much closer to AGI you'll still need a human in the loop (both to interact with the AI, and to interact with the team/boss), with AI as a tool not a human replacement. So, the number of jobs may not decrease much, if at all. It's also possible that Jevons paradox applies and that the number of developer jobs actually increases.

It's also possible that human-replacement AGI is harder to achieve than widely thought. For example, maybe things like emotional intelligence and theory of mind are difficult to get right, and without it AI never quite cuts it as an fully autonomous entity that people want to deal with.

> UBI or whatever...because we'll have to.

Soylent Green ?

Havoc

21 days ago

re reasoning traces - not sure frankly. I get what you're saying in that there is only so much advanced thinking you can learn from just scraping github code, and it certainly seems to be the latest craze in getting a couple extra % on benchmarks but I'm not entirely convinced it is necessary per se. Feels like an human-emulation crutch to me rather than a necessary ingredient to machines performing a task well.

For example I could see some sort of self-play style RL working. Which architecture? Try them all in a sandbox and see. Humans need to trial & error learning as you say. So why not here too? Seems to have worked for alphago which arguably also contains components of abstract high level strategy.

>Jevons paradox

I can see it for tokens and possibly software too, but rather skeptical of it in job market context. It doesn't seem to have happened for the knowledge work AI already killed (e.g. translation or say copy writing). More (slop) stuff is being produced but it didn't translate into a hiring frenzy of copy writers. Possible that SWE is somehow different via network effects or something but I've not heard a strong argument for it yet.

>It's also possible that human-replacement AGI is harder to achieve than widely thought.

Yeah I think the current paradigm isn't gonna get us there at all. Even if you 10x GPT5 it still seems to miss some sort of spark that a 5 year old has but GPT doesn't. It can do PHD level work but qualitatively there is something missing there about that "intelligence".

Interesting times ahead for better or worse

iamacyborg

23 days ago

Here’s an example of a plan I’m working on in CC, it’s very thorough, albeit required a lot of handholding and fact checking on a number of points as it’s first few passes didn’t properly anonymise data.

https://docs.google.com/document/u/0/d/1zo_VkQGQSuBHCP45DfO7...

machiaweliczny

22 days ago

Yeah, that's my current gripe but I think this just needs some good examples in AGENTS.md (I've done some for hooks and it kinda works but need to remind it). I need good AGENTS.md that explain what good abstraction boundary is and how to define is the problem is I am not sure I know how to put it into words, if anyone has idea please let me know.

EGreg

22 days ago

This is exactly what we found out a year ago for all AI builders. But what is the best way to convince early investors of this thesis? They seem to be all-in on just building everything from scratch end-to-end. Here is what we built:

https://engageusers.ai/ecosystem.pdf

malka1986

22 days ago

I am making an app in Elixir.

100% of code is made by Claude.

It is damn good at making "blocks".

However, Elixir seems to be a langage that works very well for LLM, cf. https://elixirforum.com/t/llm-coding-benchmark-by-language/7...

hebejebelus

22 days ago

Hmm, that benchmark seems a little flawed (as pointed out in the paper). Seems like it may give easier problems for "low-resource" languages such as Elixir and Racket and so forth since their difficulty filter couldn't solve harder problems in the first place. FTA:

> Section 3.3:

> Besides, since we use the moderately capable DeepSeek-Coder-V2-Lite to filter simple problems, the Pass@1 scores of top models on popular languages are relatively low. However, these models perform significantly better on low-resource languages. This indicates that the performance gap between models of different sizes is more pronounced on low-resource languages, likely because DeepSeek-Coder-V2-Lite struggles to filter out simple problems in these scenarios due to its limited capability in handling low-resource languages.

It's also now a little bit old, as with every AI paper the second they are published, so I'd be curious to see a newer version.

But, I would agree in general that Elixir makes a lot of sense for agent-driven development. Hot code reloading and "let it crash" are useful traits in that regard, I think

user

22 days ago

[deleted]

joduplessis

22 days ago

Recently I've put Claude/others to use in some agentic workflows with easy menial/repetitive tasks. I just don't understand how people are using these agents in production. The automation is absolutely great, but it requires an insane amount of hand-holding and cleanup.

baq

22 days ago

Automate hand holding and cleanup obviously. (Also known as ‘harness’.)

user

22 days ago

[deleted]

iamleppert

22 days ago

I use Claude daily and I 100% disagree with the author. The article reeks of someone who doesn't understand how to manage context appropriately or describe their requirements, or know how to build up a task iteratively with a coding agent. If you have certain requirements or want things done in a certain way, you need to be explicit and the order of operations you do things in matters a lot in how efficient it completes the task, and the quality of the final output. It's very good at doing the least amount of work to just make something work by default, but that's not always what you want. Sometimes it is. I'd much rather prefer that as the default mode of operation than something that makes a project out of every little change.

The developers who aren't figuring out how to leverage AI tools and make them work for them are going to get left behind very quickly. Unless you're in the top tier of engineers, I'm not sure how one can blame the tools at this point.

anshumankmr

22 days ago

IDK its been pretty solid (but it does mess up) which is where I come in. But it has helped me work with Databricks (read/writing from it) and train a model using it for some of our customers, though its NOT in prod.

doug_durham

23 days ago

Did the author ask it to make new abstractions? In my experience when I produces output that I don't like I ask it to refactor it. These models have and understanding of all modern design patterns. Just ask it to adopt one.

bblcla

23 days ago

(Author here)

I have! I agree it's very good at applying abstractions, if you know exactly what you want. What I notice is that Claude has almost no ability to surface those abstractions on its own.

When I started having it write React, Claude produced incredibly buggy spaghetti code. I had to spend 3 weeks learning the fundamentals of React (how to use hooks, providers, stores, etc.) before I knew how to prompt it to write better code. Now that I've done that, it's great. But it's meaningful that someone who doesn't know how to write well-abstracted React code can't get Claude to produce it on their own.

michalsustr

23 days ago

Same experience here! As an analogy, consider the model knows both about arabic or roman number representations. But in alternate universe, it has been trained so much on roman numbers ("Bad Code") that it won't give you the arabic ones ("Good Code") unless you prompt it directly, even when they are clearly superior.

I also believe that overall repository code quality is important for AI agents - the more "beautiful" it is, the more the agent can mimic the "beauty".

user

22 days ago

[deleted]

esafak

23 days ago

> Claude doesn’t have a soul. It doesn't want anything.

Ha! I don't know what that has to do with anything, but this is exactly what I thought while watching Pluribus.

jondwillis

22 days ago

Regardless, yet another path to the middle class is closing for a lot of people. RIP (probably me too)

geldedus

21 days ago

The level of anti-AI cope is so entertaining!

lxe

22 days ago

Eh. This is yet another "I tried AI to do a thing, and it didn't do it the way I wanted it, therefore I'm convinced that's just how it is... here's a blog about it" article.

"Claude tries to write React, and fails"... how many times? what's the rate of failure? What have you tried to guide it to perform better.

These articles are similar to HN 15 years ago when people wrote "Node.JS is slow and bad"

MarginalGainz

22 days ago

This mirrors my experience trying to integrate LLMs into production pipelines.

The issue seems to be that LLMs treat code as a literary exercise rather than a graph problem. Claude is fantastic at the syntax and local logic ('assembling blocks'), but it lacks the persistent global state required to understand how a change in module A implicitly breaks a constraint in module Z.

Until we stop treating coding agents as 'text predictors' and start grounding them in an actual AST (Abstract Syntax Tree) or dependency graph, they will remain helpful juniors rather than architects.