hackernews client

Usually I describe the problem, explore a bit with LLM iteratively. Then I switch to creating a plan when I have enough insight (and the LLM has it in context/same session as exploration), specifying all the things I'm trying to accomplish.

Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.

Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.

11 days ago

I totally agree. I loved coding because of its closed feedback loop. Since last November, I also delegated it mostly to agents. Now I concentrate more on the design part, which is not the same. However, you move with the times and hope something else will become exciting. I do not know a more worthwhile and satisfying way than computing to spend my work hours.

OakNinja

11 days ago

To me, LLM's free up time for me so that I can spend time on the fun parts of coding. Less boilerplate, more focus on the interesting problems. This is no different from using high level languages. The problem domain is less around memory management and garbage collection and closer to the problem you're actually trying to solve.

dawnerd

11 days ago

apsurd

11 days ago

it can be easily dismissed because "anyone can use the tool that costs $20" makes no meaningful sense.

this was always true in fact $20 is more than the free it costs for notepad++

it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.

IshKebab

11 days ago

I have no idea what you're trying to say. If anyone really can vibe code then programming salaries are pretty much guaranteed to come down. The critical question is whether it really is true that anyone can do it, or if it still requires rare skill.

11 days ago

Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.

yieldcrv

11 days ago

11 days ago

11 days ago

> Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.

I still must hand hold it every day, as it always does things wrong. Especially after it got seriously nerfed in March.

Note: experiences vary a lot depending on the programming language used, and projects. And the experience of the person coding.

jackzhuo

9 days ago

Same experience here. I now think AI writes much better code than me. So I shifted my focus to finding requirements, analyzing possibilities, and making good plans.

szundi

11 days ago

[dead]

bluegatty

11 days ago

Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.

'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.

asdff

11 days ago

Nitpick but commercial roofers prefer pneumatic over battery.

smackeyacky

11 days ago

This is a great analogy. Jan/Feb this year was when the models crossed from useful to essential.

szundi

11 days ago

[dead]

magicalhippo

11 days ago

I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.

Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.

For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.

I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.

I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.

Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.

Since it's so async I can work on other stuff while they plod along.

I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.

WesolyKubeczek

11 days ago

> Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases.

> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.

> I do check the documents, and what they're doing. I also check the tests, some more thorough.

Sounds like programming, but with extra steps.

magicalhippo

11 days ago

It's software development, but with much less actual programming (in my case none).

When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.

Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.

dawnerd

11 days ago

Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.

magicalhippo

11 days ago

For me, the fun in programming is sometimes to actually write code, solving a problem in a specific way or try some new approach. Other times the fun is to create something that works, and the code is more a means to an end.

Snippets, Code Generators, and Copy-Paste gives me sample that I can trust, although I may need to edit. But LLM doesn’t. And I’m doubly doubtful when it’s something I’m not familiar with.

mrcsharp

11 days ago

> planning every last detail before writing code is boring

Not only that but you can't really plan everything. It is impossible. Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.

There is no way for a programmer to consider all of these little things ahead of time and if an attempt is made, it will take as long as actually writing that code.

magicalhippo

11 days ago

> Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.

Part of this is true, part of it the agents catch at least a non-trivial portion of. If you prompt it to do a review, especially with a specific angle like ensuring sustained write performance, or how it will work when the future extensions are implemented, they do often catch a lot of issues.

I agree you lose a fair bit of the sense of "it feels like I'm doing something wrong", or "this doesn't seem optimal" etc. I think the skill in using these tools is to determine when you need that control and where it doesn't really matter.

11 days ago

None of it is non-trivial tho. You might think so, but it’s not.

magicalhippo

11 days ago

It wasn't trivial in that I used a lot of my programming and domain knowledge, both when iterating on the design document and skimming implementation plans.

I didn't use it often, but when it was needed it was needed.

ryanjshaw

11 days ago

I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.

I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.

At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.

ben_w

11 days ago

Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.

(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").

minimaxir

11 days ago

I divide the work to fit within that 100k and use subagent for the tasks.

danielbln

11 days ago

In my experience it's more like 400-500k tokens.

ReptileMan

11 days ago

Anecdata of 1 but it is real. At the end of last year they passed some invisible threshold and became useful. I don't think it is models themselves, but mostly the much more powerful harnesses and I guess their tool calling abilities.

What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.

If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.

iLoveOncall

11 days ago

It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.

When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.

The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.

Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.

harshitaneja

11 days ago

I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.

But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.

Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.

iLoveOncall

11 days ago

You're completely twisting what I said. I've never talked about people claiming it's not making developers obsolete. We are obviously extremely far from that. I'm talking about people who say it doesn't work to build basic features in their projects correctly.

Just take a look at this comment on a different topic, which lists all the pre-requisite for those AI models to work well, from the perspective of someone who has bought into the hype: https://news.ycombinator.com/item?id=48157235

If this is everything needed for an LLM to generate acceptable code, what is even the point of them?

harshitaneja

11 days ago

Maybe we come from different cultures and context is harder to grasp just in text so maybe for those reasons your response feels ruder than I hope it was intended to be.

I am sorry for not being clear in my response but I didn't intend to twist your words. I am not sure where I did so. My response was intended to be a more general remark on the kind of discourse on this topic I see and that I think both sides are right from the context they are looking in with and also why I think both sides come out of this discussion exhausted of the other. Not discounting presence of bad actors but generally I think there are most engaging in good faith like you are probably.

Coming specifically to respond your last response, I don't think one needs all of these prerequisites to get value out of LLMs. In fact LLMs have helped me untangle some very messy ball of muds on projects where we previously deemed it not worth the effort and basically carried some codebases as legacy. Now we can write enough tests to feel confidence and do a port against those tests all in a span of few days, which we found impressive.

Now having said all this, I think I understand your perspective a bit better on your original comment.

While it's a very versatile hammer, if it doesn't work for your use case that's all great. I just think that a bit more patience though with honing it maybe could help you find areas where it could work for you. If not, cheers!

vikramkr

10 days ago

That's a list of like 6 things. And each of those less complicated a question then the seven thousand questions people throw at you when you complain about something not working right on a Linux distro or about speeding up build times for a new tool or configuring webpack or like pretty much any software tool. What lint rules are you using are you using poetry or uv are you running on Mac windows linux or wsl how are your security groups configured in aws - some tools are more plug and play but it's quite the stretch to say that asking "how is your code organized, do you have your agents.md config file set up, do you have tests, and how large is the codebase" is some sort of unmanageable list of questions for a software engineer to think through when figuring out wtf is going on with some new tooling they're using

vikramkr

10 days ago

My take is there was one big inflection point around opus 4.5 when they got the agentic stuff working and now whether or not it works depends on whether your use case/area of software engineering is profitable enough for the companies to have spent a bunch of money generating synthetic data to RL on, or if it's similar enough to areas that they've done that for. With similar enough being a very loose constraint given how much overlap there is in a lot of coding fundamentals. Tbh if the models aren't working for you now I don't think they're gonna be working for you in 6 months

vikramkr

10 days ago

It's very real but probably very domain specific. It got really good at a lot of traditional web dev stuff, bash, sql, and writing one off scripts to accomplish random tasks (hence all the agent stuff taking off). And they got good at staying on task. That may not translate to game dev because from what I understand a lot of these gains are basically around post training methods driven by synthetic data generation etc (with potential caveats on how synthetic that data actually is lol). I wouldn't be surprised if the areas of code the llms are good at now are straight up just product decisions of where to allocate budget for generating those synthetic data sets, and game dev stuff might not be at the top of the list because the customer base for that might not be as big

sofixa

11 days ago

Counterpoint, I'm also vibecoding a game, and even before doing the "proper" setup (a good AGENTS.md, skills people have published for my chosen game engine, Godot), mechanically, the game was pretty spot on. It looked boring, so I used Claude Design to create a few mockups to choose from, chose the one I liked the most, and told Claude Code to redo the game UI with it.

Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.

But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.

And I know where to make slight changes without burning my allotments.

LAC-Tech

11 days ago

"flash" or "fast" AI models are worse than useless at coding for me. they make my codebase much worse. It's a maintenance burden.

Gemini Pro on the other hand can be quite a pleasant experience.

righthand

11 days ago

I mean this blog post and many from this author are pure evangelism and marketing. Can you find anything critical or any dissent from this author about LLMs?

romaniv

11 days ago

> there’s zero chance any AI lab would train a model for such a ridiculous task.

A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.

100% marketing, 0% science.

[1] https://arxiv.org/pdf/2303.12712

godelski

11 days ago

For those curious, Simon's first public usage of it is Oct 25th, 2024[0]. While I'm not aware of any specific "pelican riding a bicycle" prompts being tested in a paper[1], the GPT paper did several SVG and tikz tests and the actual image is rather arbitrary. You wouldn't want to optimize for a singular image but also if you're doing halfway decent training a pelican riding a bicycle shouldn't be too hard to draw, and well... you can see several good examples if you look through different pages on [0].

[0] https://simonwillison.net/tags/pelican-riding-a-bicycle/?pag...

[1] I'm sure there is because of Simon's fame

joe_the_user

11 days ago

My own informal test when generative AI came out has been "a picture of an old man riding a bicycle over a river". I just ran it for chatgpt with the standard model I have (5.5). It shows the old man on an old bicycle with the bicycle on a slack line and the slack line extending over the river with a medieval village in the background.

The point is that the prompt has a subtle ambiguity - "how is the old man going over the river?". My sense is that most humans would quickly imagine a conventional bridge with a road on it leading over a river and with the river background being in an area developed enough to allow bridge going over it.

So the implication I draw is these things can find/generate stuff that roughly satisfies the conditions (and are getting better at this) but they still fail add the assumptions that people would draw.

So my conclusion is that LLMs are getting better and better at "what they" but there are going to be places where they fail to satisfy human common assumptions.

_carbyau_

10 days ago

> but they still fail add the assumptions that people would draw.

I have mixed feelings about this. I agree with the default assumptions you have as to "what people would draw", however what do you want from this cognitive automation?

Do you want, "what most people would do" or do you want "something creative, an outlier, that still satisfies conditions" ?

portmanteur

10 days ago

I would want to know the LLM has a reliable and realistic World Model underneath all of the next token prediction.

Whether I am building hardened engineering systems, or discussing cooking methods, or discussing sensitive health concerns, or navigating complex psychological and interpersonal issues, the model will inevitably have to make some assumptions about context I haven’t provided. I want to know that those assumptions are grounded in reality.

11 days ago

> The coding agents got really good

It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?

Absolutely not, not quite there not even close in my experience.

But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.

But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!

That's why the debate is so polizered imo, there isn't a shared experience

kstenerud

11 days ago

The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.

For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...

And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.

Philip-J-Fry

11 days ago

I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.

Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.

Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...

walthamstow

11 days ago

Take it up with Anthropic. It's actually their billion-dollar TUI product you're commenting on.

The problem with being such a naysayer is that you're entirely disconnected from what's going on. You haven't tried an agent like Claude Code and experienced it for yourself, so you don't recognise what it looks like when it's in front of you.

SlinkyOnStairs

11 days ago

There are two possibilities here:

1) This tool breaks the Claude TUI. Exactly as described by the comment.

2) The Claude TUI itself is broken. The comment is wrong, but assuming the "billion dollar TUI product" is capable of basic rendering and it's the wrapper that broke it, that is an entirely reasonable assumption

The fun here is that both of these softwares were made extensively using AI. No matter which of our options is the case here, the point stands. An AI-built product was shown, it looks obviously ass.

kstenerud

11 days ago

The issue is likely that the tmux session being generated is for some reason not propagating all term caps. Most likely it's an interop issue between tmux and docker and the image running under docker - possibly even something with the terminal client that the pipeline doesn't like somewhere.

Claude Code correctly reduces its display to 7-bit ASCII in response (still functional, although less pretty). Once I get around to fixing this, it will probably result in another section in https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...

Edit: Looks like it's the terminal. That's a rabbit hole for another day.

You'll notice the incredible amount of vitriol resulting from a purely cosmetic bug (which, it turns out, results from a missing TERM env in the base image - Claude is very conservative when it can't determine utf-8 support with 100% certainty).

godelski

10 days ago

  > The Claude TUI itself is broken.

I mean this is also true. You forgot the third option, that 1 and 2 are true (and 4th, that neither are).

Seriously, the Claude TUI fucking sucks. I don't know how anyone thinks otherwise. It breaks constantly if you enter your editor (<C-g>), or resizing windows/panes, or making another pane full screen, scrolling, or any number of things. It is objectively a bad piece of software.

And honestly, are we surprised? Anthropic says themselves that a lot of code is written by Claude. They've been saying that for years. If you look at agents now and think "man, agents a few years ago sucked" then this shouldn't be surprising at all! I mean FFS the thing spits out text and they designed it like a fucking game engine. It is silly

Philip-J-Fry

11 days ago

I have tried Claude code. It doesn't look like that!

I don't know what the project is. All I see is a TUI that looks completely broken.

Go and use Claude Code right now. Does it look like that? Random underscores all over the page. No it doesn't.

walthamstow

11 days ago

It can look like that in certain conditions. The question is why are you so eager to give critique on unrelated work, appearing in a demo screencap, to someone who didn't produce it?

Philip-J-Fry

11 days ago

I don't know what you're talking about.

11 days ago

Do they also hold their hammer wrong when their TUI flickers for months?

embedding-shape

11 days ago

That's just poor engineering, product building and testing, same can happen with/without LLMs, no doubt.

knollimar

11 days ago

If the company making hammers can't hold it right, it suggests something about the hammers, no?

10 days ago

> I wouldn't trust a toolmaker who doesn't know how to use the tools decently.

I agree but would extend that qualification:

I wouldn't trust a toolmaker who doesn't know how to use the tools decently for exactly the same field of expertise.

godelski

10 days ago

  > No, it just means Microsoft is bad at products

FYI, that's what people are saying...

11 days ago

Hmm, ok, I think the penis in case is a bit distracting, can you de-analogize this to their real terms and tell me what this is supposed to mean and be related to developing with LLMs?

malfist

11 days ago

Just because you _can_ do something with a tool, doesn't mean it's the right tool for the job. Just because someone has contorted their entire process to adapt to a misshapen tool, and gotten good results, doesn't mean that's the right thing to do.

It is reasonable to both use the right tool for the right job, and demand better tools than you currently have. Success with the wrong tool in the wrong job doesn't mean it's the right tool for the right job.

embedding-shape

11 days ago

> Just because you _can_ do something with a tool, doesn't mean it's the right tool for the job. Just because someone has contorted their entire process to adapt to a misshapen tool, and gotten good results, doesn't mean that's the right thing to do.

Ok, I agree with this, don't use the wrong tool for the wrong job.

> It is reasonable to both use the right tool for the right job, and demand better tools than you currently have. Success with the wrong tool in the wrong job doesn't mean it's the right tool for the right job.

Yes, I agree with this too.

I'm still not sure how this relates to LLMs and particular this specific context. I claimed that the output of your agents depend on the developer driving it. You're saying "not every tool is right for every job", I agree with this too, but is that against/for what I said?

Could you just clearly write out exactly what you're arguing for here, no analogies or metaphors, just plain and simple, because I still feel like we're having two different conversations.

gcr

11 days ago

They’re talking past each other. For some, “high quality” is a comment about implementation elegance. For others, “high quality” is about duct-taping crude implementations together to fashion a kickass user experience. To most, quality probably involves some convex combination of these.

user

11 days ago

[deleted]

my-next-account

11 days ago

I have used those tools, I don't think they're THAT good tbh :P

godelski

10 days ago

I use claude every single day at work. I've burned hundreds of dollars a week in tokens. But I still think you're being too defensive while attacking Philip.

I'm sorry, but you need to look yourself in the mirror. You didn't like what they said so you jumped to the assumption that they must not have used CC (or any other agent). That if they had, they would have the same experience as you did/do. But this whole thread is exactly that conversation, that those experiences aren't shared. That this assumption is baseless. And you know what? That's okay. We're not robots. We're human. Each of us has our own unique world we live in. It's okay that people don't have the same experience as you. It's okay that their favorite color, food, activity, or whatever isn't the same as yours. I'm glad that we live in that kind of world. That's what makes things like culture. I don't want to live in a hive mind, and I don't think anyone else does either.

vdelpuerto

11 days ago

That is the same fight the 2D animators were having with 3D aninmation 30 years ago. The resolution is likely to be the same: the tool wins but the fundamentals stay, and the line between competent and incompetent practitioners moves but does not disappear.

godelski

10 days ago

  > I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me.

Honestly, I think this is where the big divide is. People have massively different opinions on what "quality" is. Which is okay, but it feels like everyone is working under some assumption that quality is this very clear objective measure that we all agree on. Clearly we don't. We didn't before AI and well... if you can't tell that we don't with AI... you need to take a step back.

FWIW, I agree with Philip here. I don't think this screams "high quality" to me. I'm also not trying to take a shit on your project. Nothing screams "terrible" to me, but yeah, it does look a bit sloppy. There's no polish to it. It looks like someone that grades on "it works" and that's fine. But it also isn't everyone's cup of tea. Where the sloppiness comes in is like what Philip said. First thing I saw was the gif and well... I think Claude Code is sloppy. But this is also a great example at how and where LLMs visibly fail. Creating a box in text is pretty simple. There's tons of tools to do it. And the LLM 100% knows about characters like ⌜⌝⌞⌟⎜, it just doesn't use them and doesn't care. The code itself also looks very LLM generated.

It's fine and I don't think you have any reason to be ashamed of it, but I also wouldn't go around boasting that it is an example of high quality work too. And FWIW, I can't think of a single heavily LLM assisted code where I don't have similar feelings. I've seen stuff with more polish, but yeah, they feel off.

  > TUI

This is a space I feel weird in. I love the terminal. I love that there's a lot of new TUIs. But it also feels very weird because it is extremely clear that a lot of these new TUIs were written by people (or machines) that don't really have a lot of experience in the terminal itself. There's a real shared language by people like me who live in the cli. There's a reason people like me can pick up a new tool and guess certain flags and certain ways to use them. It's because of a shared design language that we know of and we end up writing that way because we know it reduces to cognitive load on our peers. But the LLMs? They don't have that shared experience.

I think this is true for a lot of stuff, not just TUIs or bash tools. Things just smell... off...

kstenerud

11 days ago

You do realize that you're complaining about the Claude Code TUI, right?

That's not what this product is; merely a tool it uses.

11 days ago

> I did not much more than a cursory glance too, but found "./sandbox/create.go", a ~1300 lines long file with so much duplication even within just itself that I stopped counting.

Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet.

Admittedly, the parsing & escaping code and some utility functions could be moved outside to shrink the file, but otherwise I'm having trouble finding issues with the code.

embedding-shape

11 days ago

The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.

Look for slight variations of the same thing but with different paths, variables, or modes and I think you'd be able to spot the rest as well.

kstenerud

11 days ago

You consider adding in-place constructed items to an array to be code duplication?

freedomben

11 days ago

I've noticed that the bar for "quality" when people judge AI is often significantly higher than what they'd hold a human to. I'm not saying GP et al are doing this (I haven't looked myself), but it is a widespread pattern I've noticed both professionally and personally. I don't know why it is.

16bitvoid

11 days ago

The bar isn't any higher. There's just no grace given. No one is judging a hobby project made by a human on quality, and the person who the hobby project belongs to will rarely say that their code is high quality. And in a professional setting, I think people are fine with "good enough" but they're not going to claim anything is high-quality.

But people are so quick to label their vibe-coded codebase as high quality and no grace is going to be given to a machine.

What comments are you seeing that are calling code from humans high-quality?

whateveracct

11 days ago

Grace shouldn't be given though. The code from vibe coding should pass the review bar as-is. If you need to iterate, you've defeated the purpose.

Because the end result is people committing bad code. For some random hobby project, sure who cares. But people are using this at work. The codebase is rotting in a new innovative way.

Either the bar has to be set at "actually good code comes out of vibe coding" or you have to accept that codebases are going to steadily become less usable by human coders who use their fingers to type in emacs.

Suddenly every dev needs an agent to even work with the slop. Seems like an outcome Anthropic would love though....

breuleux

11 days ago

People who use AI set the bar themselves when they claim they generate "very high quality work using Claude". Humans more rarely make such claims about the code they write themselves, but when they do, I expect they face similar scrutiny.

AI code is competent, but it's not great or high quality unless you have a good enough eye for quality to steer it with an iron hand. But if you do, you know the quality comes from proper guidance, so you still wouldn't say AI code is great. If you do say exactly that, it comes across as having low standards (which is fine if you own it) and people are going to jump on that just to bring you down a peg.

ThrowawayR2

11 days ago

> "I've noticed that the bar for 'quality' when people judge AI is often significantly higher than what they'd hold a human to."

Because that is literally the hype being fed to us by the marketers at the AI companies and HN users promoting AI.

- AI promoters: "AI is doing Ph.D level work! LLMs are not just a token predictor, it is actually thinking and reasoning! It will replace all developers, including _you_, so get on board the AI hype train now!"

- AI promoters when confronted with blatant mistakes and reasoning errors from cutting edge models: "Why are you holding LLMs up to higher standards than humans? That's not fair or reasonable."

kenjackson

11 days ago

I have seen it too. The answer is easy - they don’t like AI. I've seen similar things with some people that don’t like women in tech or certain minorities - they suddenly critique at an extremely high level. I also haven’t looked at this particular case, but it wouldn’t surprise me to be the same thing here.

embedding-shape

11 days ago

> I also haven’t looked at this particular case, but it wouldn’t surprise me to be the same thing here.

Be surprised then, because me, who left the critique, probably exclusively programmed with agents for the last year or so, so unlikely I think the code is bad because I "don't like AI". I don't love it either, but wouldn't call myself a AI-hater by any measurements, would be weird to write articles like this if so: https://emsh.cat/en/one-human-one-agent-one-browser/

kenjackson

11 days ago

Again, I wasn't reacting specifically to you (as noted, I wouldn't be surprised if so, but I also wouldn't be surprised if not). I was making a more general statement.

TurkTurkleton

11 days ago

Dude, are you for real? We've had the supposed inevitability of AI rammed down our throats since the minute LaMDA convinced Blake Lemoine it was sentient, we've watched CEOs hype up AI as if it were production-ready while it was still barely beta quality, LLM-driven chatbots have been stapled to the side of every product no matter how little sense it makes since OpenAI published an API, and we've been told to prepare for the inevitable "agentic future" even as Claude 3.5 had to have its hand held more than a wet-behind-the-ears freshman summer intern. We're told that this technology is going to eat the entire world economy and render human labor obsolete, starting with our jobs, but if it's genuinely supposed to do that, I think it's more than reasonable to expect it to write superhumanly perfect code, not just code that's incrementally better than the last model release but still bad; extraordinary claims require extraordinary evidence, after all. To liken AI skepticism to the obstacles faced by women and minorities in tech is a category error that trivializes actual human struggles against human prejudices.

everforward

11 days ago

I looked through and there's a bunch of stuff that's in poor coding practice.

E.g.

https://github.com/kstenerud/yoloai/blob/main/internal/fileu... <- that recursively creates directories, but will only change permissions on the innermost dir (user may be unable to cd into intermediary directories)

https://github.com/kstenerud/yoloai/blob/main/internal/mcpsr... <- all the json.Marshal calls in this file just suppress errors, so if anything un-marshallable ends up in there the app will return empty strings with no errors logged

https://github.com/kstenerud/yoloai/blob/main/runtime/regist... <- `Register` embeds a copy of the code from `IsAvailable` because of the locking; that could be replaced with a private `isAvailable` that has no locking that both use (after doing their own locking)

https://github.com/kstenerud/yoloai/blob/main/runtime/exec.g... <- these functions are identical except for the strings.Trim, one should just call the other and then trim the output

Just out of curiosity, I enabled some other linters and it looks bad. Excluding test files, there are 110 functions with a cyclomatic complexity over 10 and 7 that are _over 50_. The worst is at 86, which is mind-boggling.

Could probably find more, but you get the drift. I'm sure it runs, but stylistically this is more along the lines of what I would expect an intern to do.

This is also sort of nit-picky, but like half the stuff in https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe... isn't idiosyncratic, it's just the way those things work and a lot of them aren't even tricky. The one linked is particularly blatant; that's not limited to os.Stat that's literally just how permissions work. Denying permission on inodes is a property of the folder, not the file.

11 days ago

Feel free to open a bug report if it bothers you. Or a PR.

Or feel free to avoid the tool entirely if this UI issue shakes your faith in its overall quality down to its very foundations.

This is hardly a hill to die on.

sjagauanbdvva

11 days ago

You’re missing the point.

You claimed high quality and provided a repo.

Did you not expect someone to actually look and critique it?

Whether the visual bugs are a deal breaker or not isn’t the point.

The point is that’s not high quality code, it may work. But it’s not code I would ship at my job and therefore it’s not high enough quality for anyone serious

kstenerud

11 days ago

Hey that's fine. You're free to make whatever judgment you wish.

But I still stand by the quality of my code, including here. You and I don't need to agree.

What decades of managing codebases (public and private, huge and small) has taught me is that there will always be an endless list of bugs and feature ideas and nice-to-haves and technical debt pressures in any given project. You'll never get to them all, so you prioritize (as I have done here). Functional bugs usually trump visual ones unless they're actually interfering with work.

Will I fix this bug? Probably, now that I'm aware of it. But there are more important matters to attend to first.

Edit: Turns out the bug comes from a mismatch with the terminal I'm using. With other terminals it looks fine. Term caps are surprisingly complicated, especially when you have multiple layers!

gilrain

11 days ago

> But I still stand by the quality of my code, including here. You and I don't need to agree.

You aren’t having a disagreement with a person. You’re having a disagreement with reality.

kstenerud

11 days ago

> You aren’t having a disagreement with a person. You’re having a disagreement with reality.

How so? Are you going to instruct us all on how a termcaps mismatch bug is an indicator of poor code quality, rather than an unfortunate bug emerging from within the chaos of the many layers of disparate technologies that must somehow be stitched together (along with their idiosyncrasies) in order to make a project like this work?

sjagauanbdvva

11 days ago

Because you won’t listen to a word anyone says lol.

You had a visual bug right at the top of the repos README. Then insisted you hadn’t noticed it before.

Whats important is not that specific visual bug, it’s what that bug says about the rest of the code.

How can we believe that this code is high quality if we see a glaring issue 5 seconds into opening the github?

We didn’t seek out your repo and start lobbing critiques at it. YOU POSTED IT as an example of high quality generated code. I’m telling you I am unimpressed

kstenerud

11 days ago

> you won’t listen to a word anyone says

Really? So the discussion leading to the theory that there's likely a problem with termcaps disparity between layers didn't happen?

> Whats important is not that specific visual bug, it’s what that bug says about the rest of the code.

Really? So you can tell from a single cosmetic bug which doesn't affect its ability to perform its task, that the rest of the codebase is deficient? That's a pretty damn impressive skill!

Hater's gonna hate, I guess ¯\_(ツ)_/¯

The otherwise timid pack always circles after they sense a single drop of blood, no matter how small and insignificant.

eudamoniac

11 days ago

Dude, you have a glaring visual bug that is immediately obvious, as the first thing shown in the repo, and also would be seen every time you tested the tool, but you didn't notice it at all. That does not bode well for you noticing other aspects of quality in the tool. Maybe that's the only quality issue, but we all very seriously doubt it.

user

11 days ago

[deleted]

hiw2d

11 days ago

tbh youve embarrassed yourself here.

kstenerud

11 days ago

THAT must be why the stars are going up!

Thanks for explaining it for me.

godelski

10 days ago

Sorry dude, 84 stars isn't that much. It's a good number, be proud, but I wouldn't go boasting about it like you're some hotshot.

https://www.star-history.com/?repos=kstenerud%2Fyoloai&type=...

gilrain

11 days ago

And as we know, GitHub Stars are the same as truth. Very persuasive.

andai

11 days ago

I think you can fix that by setting an environment variable (regarding the terminal?) but it was a while since I checked. (I was running Claude as a subprocess and had similar issues.)

Also this reminds me of a principle I learned from a mentor. "People are visual buyers. If it looks good, people will think the code is good."

Unfortunately it doesn't matter whose fault the janky TUI is, people will see that and associate it with your software.

kstenerud

11 days ago

It's more along the lines of: Anyone with an axe to grind will find something to grind it on.

Early stage products will have some rough edges. We've seen that in Docker, Kubernetes, AWS, Azure, LXC, KVM, etc. And people griped and raged about the sheer incompetence of the maintainers and utter lack of quality, but they still used those tools even before the rough edges were polished away and folks finally settled down.

The less one pays for something, the more entitled one feels to whinge and heap on abuse.

I've been down this road so much now that it's no biggie if a few Karens want to blow off steam at my expense. I'm not above exposing their silliness though ;-)

wasabi991011

11 days ago

> Early stage products will have some rough edges. We've seen that in Docker, Kubernetes, AWS, Azure, LXC, KVM, etc.

Is your product really the same complexity as these?

kstenerud

11 days ago

11 days ago

I haven't done any CSS/HTML/JS level work with Claude yet. I've mainly been using it for systems level stuff.

LLMs have traditionally had problems with visual rendering (the good ol' pelican on the bicycle test). I wonder if this is more of the same?

timr

11 days ago

In this case, the visual display was fine -- I was instructing it to fix bad code from a previous round that happened to deliver the right results.

Like I said, this is just an example that happens to be CSS. I see this stuff daily, if not hourly.

freedomben

11 days ago

Great example.

Just IME, the quality of the prompt often significantly affects whether it does bad stuff like your example. It's not easy by any stretch and I'm still getting there, but I'm up to a couple dozen or so "Agent Instructions" in my CLAUDE.md files for various projects that have to say things like: "when doing TDD, don't write tests to verify bug fixes in tests" because the agent is really good at following things literally. I am sure it will continue to improve, but until then every project needs some bandaid things like that.

kstenerud

11 days ago

That's interesting. As I said I haven't tried using LLMs at this level, although I'm about to embark on some this week.

What I've found helps (at least at the other layers) is to have principles documents and standards documents for the AI to reference when it's modifying code. Principles documents describe the why, and standards documents describe the how.

So for example a few parts from my initial CSS-standards.md (still needs a lot of revision):

    ## Utility-first discipline

    **Raw utilities everywhere by default. Never `@apply` for "components."** `@apply` exists only for
    true low-level primitives that can't live in a template (e.g., `prose` overrides, embedded
    third-party widget shells).

    Wathan's stated position: extract only on "worrisome duplication." The Tailwind team explicitly
    describes `@apply` as a tool you reach for after first reaching for templates. **Premature CSS
    abstraction is the failure mode.**

    ## Spacing

    Use only the default scale (`0, 0.5, 1, 1.5, 2, 3, 4, 6, 8, 12, 16, 24…`). **Never `p-[13px]`.** If
    you need a value, change the scale in `@theme`:

    ```css
    @theme {
      --spacing: 0.25rem;
    }
    ```

    v4 uses a single `--spacing` multiplier; everything derives from it.

    ## Anti-patterns (banned)

    - **`!` important prefix** (`!bg-red-500`). Fix specificity properly.
    - **Arbitrary values for colour** (`bg-[#1da1f2]`). Define in `@theme`.
    - **Arbitrary pixel offsets** as default (`top-[3px]`). Use the spacing scale. Tolerated only as
      rare one-offs.
    - **Nested custom CSS more than one level deep.**
    - **`@apply` for any class that wraps fewer than ~5 utilities** or appears in fewer than ~3
      templates.
    - **Dynamic class string interpolation** (`text-${level}-500`) — purger can't see these.
    - **Custom breakpoints in v1.**
    - **Inline `<style>` blocks.** All CSS goes through `assets/css/app.css`.

11 days ago

> There is no way I could have reverse engineered this myself from compiled C++ code and/or packet captures! The format isn't self-describing and is incredibly dense (similar to NetFlow). In a hex viewer it looks like line noise!

I think you could have. However I don't think you would have - there is a big difference. It is a lot of work to to that, and people who try normally give up. However if your boss told you could have. Note that I suspect from your story this is more like give this to a dozen people and in 2 years you get results - at a cost of several million dollars.

jiggawatts

10 days ago

Of course, there is a philosophical difference between "impossible" and "intractable", but to a business with a budget and a schedule there isn't.

paulluuk

11 days ago

This is a pretty wild take. What percentage of human engineers are creating novel solutions for hard problems, you think? I work in R&D and even my work is 90% doing things that other people already solved. If you are really doing cutting edge SOTA work that has never been done by another human in some form or another, then kudos to you and I want your job.

hollowturtle

11 days ago

> What percentage of human engineers are creating novel solutions for hard problems, you think?

IMO Every engineer should try spending his time in a company that tries to solve new problems.

Otherwise we will be stuck, as we are now, with big tech paying you mountains of money for doing nothing, incentivizing you to embark on useless activities for letting other managers have a career, fear layoffs and when that happen complaining about it because "it's a year i'm looking for a new job" pretending same compensation and environment. Web development jobs are particularly affected by that.

In the game industry, for example, if you don't do something interesting your game won't sell a copy.

Let me stress this out again, if LLMs get you 97% there, maybe you should try another idea.

pell

11 days ago

> IMO Every engineer should try spending his time in a company that tries to solve new problems.

Yet typically 95% of software developers mainly work on CRUD-type apps. Coding agents are not perfect there either but they’re really a lot more reliable than they were a few months ago.

infecto

11 days ago

Please you don’t need to stress anything. I think you are conflating ideas.

Unique game loops ideas make a good game, it has very little to do with the engineering. This is true for most software engineering products. Most engineering work is just reinventing or reimplementing existing ideas, what you describe rarely exists. It may exist in that the people learning the new ideas think it’s novel but very little is truly unique.

kstenerud

11 days ago

The comment was directed at:

> For generating production code even with a lot of steering and baby sitting? Absolutely not, not quite there not even close in my experience.

11 days ago

Absolutely! I find its test generation, properly steered, to be top notch. In many ways it's like having a second head, because it'll spontaneously come up with test paths that I'd normally only get to after a month or so in one of my "aha! What about XYZ?" shower thoughts.

You'll also notice that Linus doesn't poo-poo AI at all. His only gripe is with people using it wrong, such as flooding security lists with drive-by security reports after pointing their agent to the code and saying "find me some VULNS!!1!1!!"

hollowturtle

11 days ago

> The code I get from LLM's is usually much better than what I get from my peers

Then you should seriously question for who you're working for imo.

> It also isn't lazy.

It is indeed lazy in my experience, as in being overly zealous when creating useless test cases and ignoring the important ones. I don't want it to test a sum I want to know a test that can "guarantee" me that a further change doesn't break existing code. And producing this high quality in tests is HARD, and requires a lot of steering with agents. This culture of tests code coverage is just wrong, the best code base I worked with had code coverage only on the net percent of code that matters, the rest is covered by for static type checking and integration tests

Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.

If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps

travisgriggs

11 days ago

> quality code

Probably where the mismatch is in this discussion. The measure of what is quality code is all over the place. For some, some form of "good enough" is quality. And for others, metrics like terseness, readability, vacuous amounts of comments, cleverness, various fuzzy measures of "idiomatic", etc, make "quality code" much more of a moving target.

dasil003

11 days ago

I think this depends a lot on the task, the existing codebase, and the taste of the operator.

In general I tend to agree with you if you're talking a codebase you are deeply familiar with, the value-add from have agents write the code probably ranges from very small to negative in most cases.

On the other hand if you're trying to make changes in systems you are not familiar with, LLMs are a huge speed boost to folks with enough experience to sniff out what would be a bad path essentially via socratic method to the agent.

Obviously there are no silver bullets and no substitute for judgment. I will say though, I'll tradeoff ugly local code for good data models and interfaces any day of the week, and there is definitely an archetype of engineer that is very precious about code without good judgment on where it matters and where it doesn't.

netcan

11 days ago

Coding goodness is just "unevenly distributed."

Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.

Also... I think our era has an intrinsic bias that change=progress, productivity, etc.

Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.

But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.

A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.

Maybe administration was never really a bottleneck.

Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.

Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.

I haven't heard of many belting out features, and increasing prices or sales.

Most bottlenecks are upstream of another bottleneck. Few are a "dam."

ncruces

11 days ago

I don't know that there was an inflection point. I know that, over the past year, they definitely became useful to me as more than auto complete.

My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:

- Go code that implements the transpiler (parsing Wasm, building an AST)

- Go code that gets generated by serializing the AST to a .go file

- Go code that manipulates the AST (to optimize it), and its effect on the generated code

- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST

- C code that gets compiled to Wasm, then translated to Go, then called by Go

- Go code that gets called by this C code to implement a C stdlib

- WAT and WAST files that are used to implement the Wasm spec tests

I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.

And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).

Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.

https://github.com/ncruces/wasm2go

epolanski

11 days ago

> But we should stop talking about 1s and 0s

I agree, but you contradicted yourself just one line above.

> For generating production code even with a lot of steering and baby sitting? Absolutely not

Moreover this is further in contradiction with several facts:

1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz

2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties

3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?

hollowturtle

11 days ago

>> But we should stop talking about 1s and 0s

> I agree, but you contradicted yourself just one line above.

>> > For generating production code even with a lot of steering and baby sitting? Absolutely not

with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not

epolanski

11 days ago

I've quoted you two tools (Ghostty and Redis) whose development now regularly uses AI assistance to deliver production code. I quoted those because their authors shared their experiences, the strengths and the limits of the tooling.

There's many more, from Flask to Docker, from Ruby to FastAPI or Tanstack. LLVM has integrated AI-generated PRs, so did Swift and Mojo. Sasha Levin has pushed into Linux Nvidia-related kernel changes that were authored by LLMs in 6.15. You can be certain there's a magnitude more where people don't admit or tag their PRs as AI generated or co-generated.

In fact I am quite confident that projects and developers that are not leveraging the tools are increasingly rare. There's really no reason in 2026 to write a non-trivial PR and not ask a cheap review to an AI tool.

The industry is changing, I don't really like the trends I'm seeing, but to state that LLMs cannot and are not writing production code, very often quality ones, (especially when used, setup and overviewed properly) is plain denial.

Your anecdotal experience isn't relevant, especially when applied to the largest parts of the industry, composed of mediocre developers working on terrible codebases.

hollowturtle

11 days ago

You cited mostly web tech, which proves my point ;) Is antirez uses extensively agents to contribute to redis doesn't mean it's a becoming industry trend. I'd say quite the contrary, it isn't in the gaming industry for example, where novel ideas matter. And btw Antirez and Linus for example, put a lot of effort into steering agent into doing the right thing for them which is totally different than "these tools become just good"

epolanski

11 days ago

Half the projects I listed are system's programming related.

In general you do seem to be unaware of the trend.

And I want to stress it out: I'm not stoked for the trend or changes, but I'm not blind either.

hollowturtle

11 days ago

> Half the projects I listed are system's programming related.

No they're not and those who are, are in overwhelming control by the engineers that steer continuously the agents in the right direction. First of all this isn't something you can do for novel ideas, especially in gaming, second it is indeed very bad the code they produce otherwise it won't require that much effort from high end professionals to bend the LLMs to their will.

11 days ago

> The code it generated hat 0 compiletime errors

And no spelling errors either!

Also,

> Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet

>> embedding-shape 1 hour ago | root | parent | next [–]

>>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.

If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?

Glohrischi

9 days ago

I'm not a native english speaker and when i mentioned that i might use LLM for fixing spellings, people argued about the use of LLM. So spelling error yes/no?

I do not understand the quote you rference at all tbh?

keybored

11 days ago

I don’t see how “fun projects” and “take our jobs” fit together in any voluntary sentence.

Glohrischi

11 days ago

Firstly i wrote examples but also etc. so its more than just that. It is also refactoring, cicd pipelines and co.

2 years ago when I prompted something, it had compile time errors left and right. Took me 3-10 iterations to even get it running.

Now its one shoting a lot. Including websides, refactorings, etc.

The question is what is missing? How far are we that it can handle huge code bases vs. smaller ones? How far are we that it can comprehend the whole architecture and doesn't try to put a service in a wrong place just becaus the context is too small?

Mythos is 10 Trillion, that might be already pushing it.

95% might be not enough for someone in sense of "yeah i can't do the 95% and i can't do the 5% either the AI can do 100% or i still need Kevin with his knowledge even if its just for the last 5%"

keybored

11 days ago

What I’m saying is that I won’t do, as a hobby and for fun, something that helps strength train my chronic unemployment. That’s a me-issue.

hiw2d

11 days ago

"We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant."

This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.

Glohrischi

11 days ago

I wrote coding job. And its true for coding jobs.

Your Product Manager is not a coding job. Your Product Owner is not a coding job.

vibe-kanban exists you could already do a proper experiment letting your PO maintain a vibe-kanban board with proper requirements and see how an agent progresses.

But 5% is often enough wwhat breaks it. Doesn't help much when your PM, PO or CEO or CTO have no clue about coding harnesses, coding agents, coding platforms, LLMs etc.

hiw2d

11 days ago

I dont have PMs or POs in my firm fella.

Im hyper efficient. You clearly are not and are full of it.

If youre only doing 5%, you should only get paid for that. lol. Are you happy to take a salary drop?

Glohrischi

11 days ago

Whats wrong with you? Why the change of tone?

voncheese

11 days ago

+1 to all of this. The challenge can be staying focused and thinking when the AI assistant is (1) moving very fast and (2) often times doing multiple things at the same time.

It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.

rconti

11 days ago

I'm moderately horrified every time claude runs the same broken, YOLO SWAG git commands from stackoverflow, gets errors, tries a few more things, then finally figures out how to commit and push correctly.

JodieBenitez

10 days ago

> Absolutely not, not quite there not even close in my experience.

Well... I don't know what you expect but so far I'd like all my colleagues to write code at the level of what I get from codex.

sroussey

11 days ago

Long term, it can be better to slowly refactor parts of your code base into the way the model expects it to be. Sometimes fighting the gradient of code’s uniqueness vs expectation is not worth it.

Razengan

11 days ago

> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot

and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)

It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.

Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.

I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.

So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.

I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.

Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png

Grok is OK for general stuff, never tried it for coding.

Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)

Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol

maccard

11 days ago

> I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.

I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.

Razengan

11 days ago

Like I said most of my prompts cover 1-3 files at most, rarely more

hollowturtle

11 days ago

Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"

jaccola

11 days ago

I set up a hook that reviews every commit and highlights potential bugs (async) and writes to a report to a dir.

Then I have a script that summarises that I usually run before pushing or at end of day.

Works quite well for both improving my code and the code ai wrote.

psadauskas

11 days ago

I first started noticing they were actually useful around Dec 2025, through about February. I got pretty good at using them, and was amazed at their utility, especially Claude and Codex. Then sometime in March, they got really frustratingly dumb. Things that they used to get right in one shot suddenly took several tried, and I had to watch them like a hawk because they constantly made stupid mistakes, not following instructions that previously worked. I had one try to fix a failing test like this:

    assert_eq x, true if x == true

Both Claude and Codex, both with the latest versions and the original versions that had been working.

Now I just use deepseek. It isn't any dumber, and it costs way less.

prettyblocks

11 days ago

I'm curious. What have you actually tried? Are you just prompting the LLM with one off tasks? For good results, you need to take the time to read the documentation for the harness you are using and configure your environment. This tuning can take dozens hours to nail down. Then there's the actual approach for working on your projects. Many people that have good results with agentic coding actually spend the bulk of their time in plan mode where they go back and forth with the LLM designing a granular playbook for the task at hand before they ever have it write any code.

hollowturtle

11 days ago

I'm curious. What makes you think that me sharing an example(which one of the many?) of what I actually tried would somehow add something to the conversation? What's the usefulness of just an anecdotal example?

As I said we have a plenty of different envs, codebases, requirements. Things are complex.

You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.

Let me stress this out again:

> That's why the debate is so polizered imo, there isn't a shared experience

prettyblocks

11 days ago

In my experience most people with the type of critique I'm seeing from you have only tried it one time or have not taken the time to invest in an environment/process that will work for agentic coding.

My question is not so much about sharing a cherry picked example, but the question was more like "have you tried in earnest to make it work". That's the part that wasn't clear from your original post. But you say you have, and you weren't impressed. Fair enough. I'm not trying to convince you otherwise, but I encourage people to give the tools a fair chance before throwing up their hands and deciding it's meh.

Having said all that, you're right there isn't a shared experience.

th0ma5

11 days ago

[dead]

newaccount670

11 days ago

hollowturtle

11 days ago

Try it then, prove your point. Good luck :)

nijave

11 days ago

Have had fairly good luck with Claude Code Opus 4.7 on xhigh effort.

I think it more reliably does IaC with established patterns especially when it can do a dry run.

Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho

Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.

I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.

treme

11 days ago

you are experiencing reverse Dunning–Kruger effect.

For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.

now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.

hollowturtle

11 days ago

Please do not cite Dunning–Kruger effect at random.

Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".

If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!

LLMs can effectively validate your business idea

AussieWog93

11 days ago

I don't really see your point. Most problems that people have aren't really super-novel, but just extremely bespoke.

To give a specific example, 12 months ago I had a client pay me me to make a Chrome plugin that changed the rows in his Shopify Products page to display Quantity and SKU.

These days you'd just one-shot it in Claude.

hollowturtle

11 days ago

First of all it just underlines how shitty the web has become, second If that's your work I'd chase a career path where Claude can't one-shot this kind of dumb stuff

AussieWog93

11 days ago

It's not my work. I'm not even a full time dev any more.

But the client's problem was solved, and they're happy.

This is a genuinely useful thing. You don't need to shit all over it.

Glohrischi

11 days ago

Thats quite a surprisingly arrogant take.

CRUD applications and converting business requirements into code is the thing software developers do to 99% day in day out.

kamaal

11 days ago

If you break down a complicated coding problem in smaller parts, it could be any problem.

You will see its basically a very reusable part thats already done uncountable times else where.

People who think they do something so special and novel that it just can't be done by non-human, struggle with breaking down a problem in smaller parts.

Even if you do have such novel problems, its not like every single day, every single bit of work you do is like that.

emilsoman

11 days ago

Curious, what's the career path you'd chase? Can you give examples of some work that you think Claude will never be able to one-shot?

AussieWog93

11 days ago

Oddly enough switched from software to selling retro games online.

Made ridiculous bank during 2019-2023, lost money 2024-2025 (I wasn't doing proper accounting at that stage, so it took a while to really internalise that the market wasn't insane anymore), looks like we'll make a decent-ish profit in 2025-2026 after pivoting the business model. Some regrets but it's possible staying in software could have been just as turbulent.

Funnily enough we're finally at the stage where I can launching my SaaS side-hustle which I've been sitting on for the past year and a half, so that could end up back in software again soon.

I would never say never, since I don't know what Claude would look like in 5 years' time, but there's plenty it can't do at the moment.

To give a concrete example, I don't let it make sweeping changes to the main "business logic" of my SaaS. Not because it's necessarily wrong but because I can't easily verify it. But I'll let it rip on peripheral stuff, or co-work with it.

y0eswddl

11 days ago

I'm beginning to get the sense that Sturgeon's Law is at play here and the non-crap 10% of us are arguing with the 90% for whom LLM's shitty output is actually better than what they could do on their own.

I've been lucky enough to work at places with majority intelligent engineers with similar tastes on quality to my own... but it seems to be that's not the norm or the case everywhere.

and it's the 90% that's most vocal. Sturgeon and D-K seen to go hand-in-hand.

mwigdahl

11 days ago

“Am I out of touch?”

“No, it’s the children who are wrong.”

jaccola

11 days ago

The obvious pushback to all of the slop is: coding was never hard. Learning resources were abundant and free.

If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??

layer8

11 days ago

> coding was never hard. Learning resources were abundant and free.

Just a nitpick regarding “never”: Learning resources weren’t abundant and free 25 years ago, that’s a more recent development.

skydhash

11 days ago

Maybe in some parts of the world (including mine). But we haven’t have a lot of computers either. But 25 years ago, there was a lot of textbooks and computers editors like O’Reilly already active. I had the C programming language book (not 25 years ago, but the book is older than that) and you could learn a lot with that one book and codeblocks. Same thing with “The Go Programming Language”, “Learning Perl”, and “Programming Clojure”. You only need one book to get very decent.

layer8

11 days ago

True, but books weren’t free. I spent quite some money buying those.

viking123

11 days ago

And when an Agent it's capable of gluing together a web app with some crud backend with a very rounded corners UI, that solves nothing for end users, we call them capable. These are not hard problems

squidbeak

11 days ago

You insist that AI needs to be able to tackle hard problems, but can't say what qualifies as a hard problem. Can you see the problem with that? If you don't know what a hard problem looks like, how do you know the models can't tackle them?

skydhash

11 days ago

It’s that it’s to able to tackle hard problems really. It’s because you have to give it the solution, and the patterns to follow, and then monitor it because it will go down weird paths.

11 days ago

> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

The answer is "for lots of people, but not you".

You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".

hollowturtle

11 days ago

When I say

> Absolutely not, not quite there not even close in my experience.

I obviously mean in my experience, not the real truth.

> That everyone else is being led by a "marketing hype

That is obvious instead, and I later say there's not 0s or 1s, every job has his intrincancies

new_account_100

11 days ago

[dead]

zarzavat

11 days ago

Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.

minimaxir

11 days ago

energy123

11 days ago

The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.

user

11 days ago

[deleted]

LZ_Khan

11 days ago

I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?

opto

11 days ago

I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...

Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.

We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"

They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.

It makes no sense to me.

tkgally

11 days ago

I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.

I appreciate I am skeptical, but it is hard not to be when the world spends all day telling you a piece of technology is going to fundamentally change the world, and in real life you only see people use it to blag CVs, personal reports, and lesson planning.

bradley13

11 days ago

11 days ago

Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.

grey-area

11 days ago

I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.

aidos

11 days ago

Depends on how it’s done.

I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just much quicker at driving them.

As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.

I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.

The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.

grey-area

11 days ago

Sure it depends how it is done but for most uses I'd say they are not appropriate - building tools with them is ok if you double check (though how many people will when the answers seem good enough at first?).

I'd find it really troubling if financial analysts are using them without knowing the deep limitations of the tooling (which the companies selling them will not highlight for you). They don't actually count or reason so they are liable to just make up figures based on their training dataset, not the data you give them.

Using them for actual financial analysis and generating reports based on data will lead to hallucinated figures which conform to what was asked for, not what the data says and silently fills in gaps in the data. It's extremely dangerous and not something they are good at at all.

aidos

11 days ago

Don’t get me wrong, I very much agree with the danger. As I highlighted - I saw it this morning when someone used Claude to draw the wrong conclusions.

I’m saying there is a way in which they can be used where there isn’t scope for numerical hallucinations at all. They can write sql queries, for example, without ever being allowed to even see numbers.

What invariably does and will happen though is they’ll inner join instead of left join and some data will get missed. Or there will be some missing context (users in this set already have a certain class of property by virtue of some selection bias and that will be mistreated as some signal etc).

Gigachad

11 days ago

Can I get Claude to view the slide decks for me so I don't waste my time?

RobinL

11 days ago

Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?

angled

11 days ago

My pipeline for this is vscode + prompts + markdown templates + GitHub copilot -> markdown docs -> pandoc to produce.docx -> copilot in word for “nice” formatting -> copilot in ppt for nice decks. LLMs all the way down.

I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.

6 days ago

Claude for Powerpoint will generate legitimately beautiful decks for you. The chat app will create them as artifacts also.

jillesvangurp

11 days ago

With a little bit of work, it works very well. You can generate powerpoint directly with Codex or Claude Cowork. There is also Canva support for these tools and it has its own AI integration. Another useful tool in this space is the Gemini integration in Google slides.

If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.

What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.

Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.

Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.

angled

11 days ago

In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.

Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.

beng-nl

11 days ago

11 days ago

I’m not him, but I’ve started using them to do the analysis (SQL, Python etc.) and then output the report as Quarto HTML which can be hosted on GitHub Pages. It works well for this analysis style work.

Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.

Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.

alexwwang

9 days ago

Yes dude. You understand fully what I mean.

alexwwang

9 days ago

All the documents that were typed with a keyboard before, now can be created by code agents with properly designed and implemented prompts and skills.

I generate my blog with this method and you can refer to: https://blog.chuanxilu.net/en/

I am responsible for all the contents but the process of those essays and reports are first generated by prompts that embody my ideas, thoughts and facts I encountered.

Quothling

11 days ago

I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.

We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.

I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.

It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).

piokoch

11 days ago

"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"

That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).

I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.

user

11 days ago

[deleted]

Havoc

11 days ago

At work the tools handed to most are still essentially chatbots. Getting access to coding tools is an uphill battle because there isn’t really a good way to manage risk yet. Hard enough to keep a coding agent in check locally and ensure it does rm -rf anything. Scale that to thousands of people with limited skill and it doesn’t really work. So currently they just don’t.

That’s in a finance shop. I’d imagine it’s different in programming shops where handing people Claude code is a bit more plausible

TrackerFF

11 days ago

Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.

Some of these are now contributors.

I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.

verdverm

My mother is going on 5 years with multiple myeloma, a cancer that would have offed her in 5 months if it weren’t for advances in maintenance chemotherapy.

Medicine has done amazing things in my lifetime.

viking123

11 days ago

Nothing ever happens.

See you in 10-30 years when people are still dying of the same shit as today like oesophageal cancer and glioblastoma.

Maybe in the next century but by that time you and me both will be under the ground, and no, Amodei's doubling of human lifespan simply won't happen.

sigmoid10

11 days ago

[dead]

okamiueru

11 days ago

AlphaFold is not an LLM. As such, it isn't a fitting example for "good news" related to LLMs.

11 days ago

That is a half-truth.

willis936

11 days ago

Metal Gear Solid 2 was quaint and funny until 2025.

redsocksfan45

11 days ago

[dead]

originalvichy

11 days ago

[flagged]

TeMPOraL

11 days ago

> - Memory market cornering (...)

Wait, what? What is that?

> - Fast penetration of IP exfiltrating tools in companies world-wide.

That goes on the benefit side, I believe.

> - Autonomous agents killing Open Source by siphoning the attention economy

Anything attention economy disappearing is a "good riddance" to me.

john_strinlai

11 days ago

>Wait, what? What is that?

i believe they are just saying that RAM prices went crazy

shepherdjerred

11 days ago

> and there’s zero chance any AI lab would train a model for such a ridiculous task.

I'm not sure that's true anymore considering how popular Simon's blog is

_puk

11 days ago

> So maybe the AI labs have been paying attention after all!

swed420

11 days ago

According to OP:

> Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.

At this juncture I'm left wondering why competing AI labs wouldn't train for this now well known "test".

nijave

11 days ago

Given their proclivity to scrape the entire contents of the internet, it's only a matter of time intentional or otherwise.

I've heard the same has happened with common benchmarks (they've ingested solutions into training data)

sfdlkj3jk342a

11 days ago

I'm surprised by Grok as well:

https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...

Interesting that it does better at making the pelican peddle in the video generation than in image generation.

IdiotSavage

11 days ago

Graphically perfect, but content-wise nonsense. The pelican's center of gravity is clearly behind the wheel. It needs to be above or very slightly ahead of the wheel.

horsawlarway

11 days ago

11 days ago

Counter-point, real incriminating videos will be easy to dismiss.

"No that's not me, that's AI"

user

11 days ago

[deleted]

11 days ago

What's the problem? If I enjoy some show, material or text, if it brings me value or a brief moment of happiness, I could care less if it was made by an AI or a human.

This racism against AI-generated stuff has to stop. If not, we'll have a butlerian jihad on our hands that will set back prosperity, development and science for decades, perhaps centuries.

People mention the artists... ohh, boohoo... either do it on your free time, improve your performance and selling skills or move to another job.

It's not my job to slave away only so that artists can day dream and produce stuff that no one cares about.

whilenot-dev

11 days ago

I think we need to start separating such concepts like entertainment from the ones of enjoyment, fascination, function, interest, satisfaction, beauty and the sublime a bit more. Art theory literally has books on these things, as they all fall under the topic of aesthetics. Do you really enjoy a frozen pizza from the oven at home in the same way as a freshly made pizza from an authentic pizza oven?

I always care about the processes involved, especially if any human work is involved, from all its accuracies to its errors. For me, interesting things happen while we balance our understandings with a certain amount of holism and a certain amount of reductionism. Putting it on either side of the scale, like your holistic statements, is just pure ideology, and that doesn't hold any merit in reality and is honestly just bland, repetitive and boring.

Retric

11 days ago

> Do you really enjoy a frozen pizza from the oven at home in the same way as a freshly made pizza from an authentic pizza oven?

11 days ago

That’s really not how this is going to play out.

When advertising agencies for example see that their copywriter can go from idea to concept with a video generator instead of engaging an animator, they’ll simply cut the middleman who used to create that animation for them and use the tool instead, even if the content isn’t as good (though the quality of this one is really pretty good, there are obvious problems). They’ll happily accept mediocrity to save money.

People will still create adverts but quality and creativity will go down and a lot of jobs are going to be suddenly displaced.

flakeoil

11 days ago

Does "creative" mean that you are creative at coming up with ideas or does it mean that you are artistic and can create stuff?

I suppose it is more the latter, and it's the artistic people who create stuff who will suffer. The ones coming up with ideas, but previously couldn't create becasuse they lacked skill might win thanks to AI.

Coming up with ideas is easy, creating and putting in the effort is hard (until we had AI).

Probably the value of created stuff will go down rapidly because there will be so much of it.

AussieWog93

11 days ago

I wouldn't be that concerned that animation is going anywhere. Both outputs look really off, especially around the feet.

wongarsu

11 days ago

In a serious creative tool you would also want a lot more creative input. At a minimum the ability to steer the animation with skeletons that feed into a control net, or something like that. And the ability to control the look and feel and create much more consistent characters. Both things that exist in good tooling, but both things that create work that will keep animators employed. But it will dramatically reduce the number of animators needed to reach a given level of "good enough".

And looking at the trajectory of the animation industry, I don't think increases in productivity will be used to raise the quality of the animation if the alternative is to just pay fewer animators

grey-area

11 days ago

Yes sure if you look closely it’s slop, but a huge number of companies and advertisers just don’t care (and they feel the same about their social media content, blogs and yes code) - they will attempt to cut corners where they can to the detriment of true artists.

spacebanana7

11 days ago

There's an interesting economic contest here as well - is it more sustainable for a malware group to spend $500 in tokens looking for an issue in my app? or for me to spend $500 scanning for issues on every deployment?

Systemically this usually favours the offence, as they could scan my app once every 6 months whereas I'd need to do it on weekly releases.

jxmesth

11 days ago

I'm a security person and would love to hear other people's input here as I don't have that much experience with this

thierrydamiba

11 days ago

Can you be more specific?

tetha

11 days ago

Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.

We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.

simonw

11 days ago

The Claude Mythos / Project Glasswing thing is real: https://www.anthropic.com/glasswing

I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.

I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/

ben8bit

11 days ago

The tooling has become so good though - the eco-system around the LLM. The models have become really good, yes - but it's definitely slowed in my opinion. The tooling is what really has become great - "harness" is probably the best word. When folk like Elon/Schmidt/Theil/etc. talk about singularities and industrial revolutions - it sounds extremely out of touch - or actually protective of the massive capex they've potentially sunk.

EDIT: Schmidt's booed commencement speech was probably one of the most out-of-touch speeches (outside of a tech interview) I've heard.

sharperguy

11 days ago

Much of the recent improvement in models is in being trained specifically to make use of the tools the harnesses give them.

chrisss395

11 days ago

How much of what is being generated by LLMs is actually value add? My perception is there are lots of great experiments, but little real value.

+ Developers are more productive, but are you all leaving work at 3p and enjoying a new found sense of work-life balance?

+ Companies are investing heavily in AI, yet I'm paying more for the same thing. Jamie Dimon still pays me 0% on my checking despite spending billions on AI.

It may be that simply adopting AI isn't enough. Could new startups that are born-in-AI buck this trend? I wonder what Clayton Christensen would say if he were still around.

ivandotcodes

11 days ago

Reading through the thread, a lot of the inflection point debate seems to come down to people talking past each other about what got better. My read is that the models themselves didn't really jump in capability around November, but the harnesses around them got considerably more reliable, and the RLVR work earlier in 2025 had been training the models specifically to behave well inside those harnesses, so when the two met you got a compounding effect that felt like a step change even though neither piece was that dramatic on its own.

11 days ago

I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.

ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.

_puk

11 days ago

11 days ago

About Pelicans on bicycles:

> there’s zero chance any AI lab would train a model for such a ridiculous task

Well, I think this guy's tests have got enough visibility that I wouldn't be surprised if some AI models are trained on it specifically...

shantnutiwari

11 days ago

yeah, simon's blogs have been on the front page multiple times now, I wouldnt be surprised if all of them added s apecial case for it

rTX5CMRXIfFG

11 days ago

Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?

bluegatty

11 days ago

You will immediately notice the difference if you use it at the threshold.

It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.

If you were to just watching them play, work out, shoot - you'd never notice the difference.

Put them head to head and it's 98-54 and you start to see the patterns.

It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.

Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.

Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.

Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.

dnnddidiej

11 days ago

Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.

Sparkyte

11 days ago

No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.

Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.

raincole

11 days ago

By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.

minimaxir

11 days ago

To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.

11 days ago

[deleted]

grey-area

11 days ago

Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).

I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.

Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.

So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.

https://github.com/openclaw/openclaw/pulse?period=daily

279 commits to main from 77 authors in the last 24 hours.

Why is there so much churn and how could you trust it with your data? This is changes in ONE day!

If these are useful changes, surely it’d be superhuman by now given months of this pace.

What are people using this for?

user

11 days ago

[deleted]

11 days ago

Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?

Is the only choice to pay for the "max" plans?

Or just read so much about it that you bs your way through an interview and then use the company's resources?

Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?

x86cherry

11 days ago

11 days ago

The honest summary that doesn't show up in the six-month roundup: the unevenness. Boilerplate, tests, scaffolding, glue code: dramatically faster, sometimes 5-10x. Architecture, data modeling, careful security work, judgment calls about what to build: same as before, sometimes slower because tab-completion sneaks in plausible-but-wrong defaults you then have to undo.

The thing headline numbers ("AI made me 3x faster") hide is which 30% of the work the AI sped up and which 70% didn't move. For a solo dev the survivable bet got smaller, and that's the real change, not raw productivity. AI made certain projects worth attempting at all that wouldn't have been viable six months earlier.

11 days ago

[deleted]

koonsolo

11 days ago

Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.

I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.

Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.

simonw

11 days ago

Yeah, I don't think the role of QA is to write automated tests - developers should be doing most of that work.

The best QA people I've worked with didn't write much code at all. You'd give them a new system and they'd find all of the bugs, testing obscure edge-cases that you'd never thought of.

Mashimo

11 days ago

Huh, never thought about QA writing unit tests.

In my limited experience they write test cases, test each story, do regression test, verify bugs from customers. All by hand.

At my current job I don't want to miss them.

koonsolo

11 days ago

They test everything manually and don't have any automated end-to-end tests? That basically proves my point ;).

11 days ago

There is an entire category of software engineers who exist entirely to knock out features on microservices or do easily automatible QA work whose jobs will disappear.

vanuatu

11 days ago

I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period

AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded

It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever

asdff

11 days ago

The problems in any domain are infinite. But, alas, money is not.

trojans1290

11 days ago

What are these skills?

stuxnet79

11 days ago

Sure give it a go, perhaps it will work better now with frontier models, I haven't tried it in a while (this was a year ago, things have improved since then). I'm not sure what tests for having amazing graphics, gameplay, input, UI, sounds, etc would look like, but it would be interesting to see the results!

vessenes

11 days ago

okay hold my beer. both claude and codex running now.

EDIT: both agents took about 20 minutes. I used that exact prompt in a clean directory for each, and then said "deploy to netlify" - so a total of two prompts.

Codex: https://astounding-bavarois-27b5a2.netlify.app

Claude: http://strong-hotteok-91dfb0.netlify.app

Netlify is having trouble claiming the Claude project, so if you need a password it's "My-Drop-Site"

FYI, Claude rated itself 7.7/10 for fun, and Codex 98/100 during the fun test loop. As you'll see if you poke at them, Claude needs a physics bug fix round. But I think these both did about what I would have expected.

grey-area

11 days ago

Nice, very retro (looking at the codex one)!

Claude one doesn't really work (collision detection was the problem I had before too), but fairly close.

Yes when I tried previously I had a few gameplay issues in frogger and I couldn't manage to one-shot this sort of thing at the time (a year ago), so last year definitely saw some good progress at this sort of thing. The asteroids game I was very happy with though, had a very cool retro feel and was wireframe only. Wasn't so keen on the code produced as it had a patchwork feel to it.

vessenes

11 days ago

To your point, I didn't even look at the code.. :) Okay, I looked at the codex code. it's super reasonable -- separation of concerns, operating on a state model, it's not over designed. I did not hate it. I also noted that codex put in a CRT simulator loop which is a nice touch.

I think a year ago this would have taken a lot of back and forth and arguing; to me that's kind of the point of Simon's article -- a lot more just 'works' now.

grey-area

11 days ago

Sorry I meant the code a year ago - it took a bit more hand-holding at that point and it was a mishmash of different things, but I feel it’s just slightly easier now - still similar. Haven’t looked into this one just had a quick play. Thanks for trying it out!

I think his article is for the last 6 months - my feeling is progress with LLMs has stalled recently and generated code still has problems with accuracy and coherence and subtle bugs, but everyone has a different experience.

vessenes

11 days ago

I agree with that. Right now, you choose:

Subtle bugs in understanding the spec but strong arch and coding (codex)

Subtle bugs in implementation but good understanding of the spec (claude).

LarsDu88

11 days ago

Frogger is kind of too well known such that there is ample training data for building that specific game.

The game I was thinking of is relatively obscure -> Panel de Pon

grey-area

10 days ago

Yes I was surprised at the time that it failed so badly at Frogger, I think from memory it was colission detection it just couldn't get right, plus the positioning of various game elements as it has quite a lot going on (the examples above still have some problems with these things). I thought there would be open source examples out there in js/html but perhaps not so much for frogger.

bluegatty

11 days ago

'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.

It is getting very good at producing code that compiles - at the algorithmic level.

This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.

But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:

-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-

It just knows how to 'incant' the duck.

This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.

This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.

We already kind of knew that - but we have not yet built an intuition for that until now.

Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise

This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.

In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.

LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.

It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.

We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.

I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.

But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.

This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.

Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.

The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.

nl

11 days ago

> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.

The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc

(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)

bluegatty

11 days ago

"That's a higher level of abstraction"

No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.

If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.

Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.

Precisely because it does not understand those things.

FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.

We're a long way away - but in the meantime, there's lots to unpack.

nl

11 days ago

> Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.

Proof by existence?

https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...

Looks pretty good to me. ChatGPT in "Thinking" model.

Edit: I've added the Opus version on the same link.

bluegatty

11 days ago

? That's evidence that it does not work.

Neither of those are from 'under' they both look either front or top?

Imagine yourself under the ducks feet, looking up at an oblique angle - wings as I suggested. The AI won't do that, it has no reference for dimensionality.

nl

10 days ago

What on earth do you mean?

I live near an area with lots of pelicans. If you look up at one flying overhead this is what they look like.

Here is a photo for comparison: https://commons.wikimedia.org/wiki/File:American_white_pelic...

bluegatty

10 days ago

Sure, something like that. Note like the examples you posted.

nl

10 days ago

I've very confused. The SVGs show the beak, wing, tail, feet and body as though viewed from directly underneath.

They look similar to the photo, but meet the instructions better ("from underneath").

What are you expecting exactly?

bluegatty

8 days ago

They don't look anything like the photo.

They're not 'oblique' - they're 'squared' views and none of the anatomy looks appropriately adjusted.

The model has no ability to 'rotate a figure in 3d space' and conceptualize how all of the elements work together.

It's 'pattern matching'.

This is the 'great intuition' for how LLMs work - it's not perfect because a lot of 'synthetic reasoning' can be done obviously.

And they probably never will, LLMs are not the right thing for this kind of task.

Think about how they can investigate massive code-bases and find arcane bugs - but cant draw a duck from arbitrary oblique angles etc.

That said, with enough examples they probably could.

squeaky-clean

11 days ago

Those are just awful compared to the side view of a pelican on a bike.

nl

11 days ago

Have you seen a pelican from underneath? There's not much to show!

IanCal

11 days ago

Are we a long way away?

https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...

tardedmeme

11 days ago

It's always been much easier to copy an existing product than to make a new one nobody's thought of before.

11 days ago

Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks

The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still

And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so

rdedev

11 days ago

I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.

Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive

4b11b4

11 days ago

RL we're gonna find out will get abandoned cuz we don't even know what is getting "aligned", just my naive gut feeling don't take it seriously

bschwindHN

11 days ago

https://imgur.com/a/UlGcBou

pamcake

10 days ago

> I put together these annotated slides from my five minute lightning talk at PyCon US 2026

Is there a video or audio of this talk?

qiine

11 days ago

>pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.

humm

subarctic

11 days ago

Is there a video of this talk?

user

11 days ago

[deleted]

tayo42

11 days ago

The claw thing really came and went fast lol

yieldcrv

11 days ago

I just started a new job and the person I report to was just excited to tell me about it, here in Mid May

"and then you have to get a mac mini, and then, and then"

smile and nod, it pays weekly

user

11 days ago

[deleted]

viking123

11 days ago

I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.

Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.

You think most of this stuff here is organic? Oh boy..

Razengan

11 days ago

AI is like Sauron's Ring: it only amplifies the user's innate abilities.

It can either help you conquer the world if you were already doing that anyway or it can make you spend your life in a cave before throwing you into a fucking volcano.

bob1029

11 days ago

It definitely seems like the point of no return has been passed.

The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.

For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.

The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.

Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.

bradley13

11 days ago

11 days ago

HN has a mechanism that causes popular blogs to stay popular.

It's a winner-takes-all karma prize for being first to post the article.

This causes a rush of people to post.

HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.

This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.

11 days ago

Years ago I used to read his blog on Django and found it quite interesting despite being neither a Django nor even a python user - this must have been at least 10 years ago and perhaps more.

When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!

hansmayer

11 days ago

TL;DR:

"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "

Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?

nothinkjustai

11 days ago

[flagged]

tomhow

11 days ago

jrowen

11 days ago

There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.

How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?

edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.