miguelgrinberg
3 hours ago
> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
zackify
an hour ago
This definitely is the case. I was talking to someone complaining about how llms don't work good.
They said it couldn't fix an issue it made.
I asked if they gave it any way to validate what it did.
They did not, some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"
Its shocking some people don't give it any real instruction or way to check itself.
In addition I get great results doing voice to text with very specific workflows. Asking it to add a new feature where I describe what functions I want changed then review as I go vs wait for the end.
petcat
an hour ago
If you tell a human junior developer just "fix this" then they will spend a week on a wild-goose chase with nothing to show for it.
At least the LLM will only take 5 minutes to tell you they don't know what to do.
speakingmoistly
17 minutes ago
To be fair, that happening feels more like poor management and mentorship than "juniors are scatterbrained".
Over time, you build up the right reflexes that avoid a one-week goose chase with them. Heck, since we're working with people, you don't just say " fix this", you earmark time to make sure everyone is aligned on what needs done and what the plan is.
ruszki
30 minutes ago
Do they? I’ve never got a response that something was impossible, or stupid. LLMs are happy to verify that a noop does nothing, if they don’t know how to fix something. They rather make something useless than really tackle a problem, if they can make tests green that way, or they can claim that something “works”.
And’ve I never asked Claude Code something which is really impossible, or even really difficult.
sobjornstad
28 minutes ago
There are subtler versions of this too. I've been working on a TUI app for a couple of weeks, and having great success getting it to interactively test by sending tmux commands, but every once in a while it would just deliver code that didn't work. I finally realized it was because the capture tools I gave it didn't capture the cursor location, so it would, understandably, get confused about where it was and what was selected.
I promptly went and fixed this before doing any more work, because I know if I was put in that situation I would refuse to do any more work until I could actually use the app properly. In general, if you wouldn't be able to solve a problem with the tools you give an LLM, it will probably do a bad job too.
raw_anon_1111
2 hours ago
I have 30 years of experience delivering code and 10 years of leading architecture. My argument is the only thing that matters is does the entire implementation - code + architecture (your database, networking, your runtime that determines scaling, etc) meet the functional and none functional requirements. Functional = does it meet the business requirements and UX and non functional = scalability, security, performance, concurrency, etc.
I only carefully review the parts of the implementation that I know “work on my machine but will break once I put in a real world scenario”. Even before AI I wasn’t one of the people who got into geek wars worrying about which GOF pattern you should have used.
All except for concurrency where it’s hard to have automated tests, I care more about the unit or honestly integration tests and testing for scalability than the code. Your login isn’t slow because you chose to use a for loop instead of a while loop. I will have my agents run the appropriate tests after code changes
I didn’t look at a line of code for my vibe coded admin UI authenticated with AWS cognito that at most will be used by less than a dozen people and whoever maintains it will probably also use a coding agent. I did review the functionality and UX.
Code before AI was always the grind between my architectural vision and implementation
awakeasleep
an hour ago
Explain how fragility of implementation, like spaghetti code, high coupling low cohesion fit into your world view?
petcat
an hour ago
As human developers, I think we're struggling with "letting go" of the code. The code we write (or agents write) is really just an intermediate representation (IR) of the solution.
For instance, GCC will inline functions, unroll loops, and myriad other optimizations that we don't care about (and actually want!). But when we review the ASM that GCC generates we are not concerned with the "spaghetti" and the "high coupling" and "low cohesion". We care that it works, and is correct for what it is supposed to do.
Source code in a higher-level language is not really different anymore. Agents write the code, maybe we guide them on patterns and correct them when they are obviously wrong, but the code is just the work-item artifact that comes out of extensive specification, discussion, proposal review, and more review of the reviews.
A well-guided, iterative process and problem/solution description should be able to generate an equivalent implementation whether a human is writing the code or an agent.
sarchertech
an hour ago
A compiler uses rigorous modeling and testing to ensure that generated code is semantically equivalent. It can do this because it is translating from one formal language to another.
Translating a natural prompt on the other hand requires the LLM to make thousands of small decisions that will be different each time you regenerate the artifact. Even ignoring non-determinism, prompt instability means that any small change to the spec will result in a vastly different program.
A natural language spec and test suite cannot be complete enough to encode all of these differences without being at least as complex as the code.
Therefore each time you regenerate large sections of code without review, you will see scores of observable behavior differences that will surface to the user as churn, jank, and broken workflows.
Your tests will not encode every user workflow, not even close. Ask yourself if you have ever worked on a non trivial piece of software where you could randomly regenerate 10% of the implementation while keeping to the spec without seeing a flurry of bug reports.
This may change if LLMs improve such that they are able to reason about code changes to the degree a human can. As of today they cannot do this and require tests and human code review to prevent them from spinning out. But I suspect at that point they’ll be doing our job, as well as the CEOs and we’ll have bigger problems.
petcat
21 minutes ago
> A compiler uses rigorous modeling and testing to ensure that generated code is semantically equivalent.
Here are the reported miscompilation bugs in GCC so far in 2026. The ones labeled "wrong-code".
https://gcc.gnu.org/bugzilla/buglist.cgi?chfield=%5BBug%20cr...
I count 121 of them.
sarchertech
16 minutes ago
If you can’t understand the difference between a bug that will rarely cause a compiler encountering an edge case to generate a wrong instruction and an LLM that will generate 2 completely different programs with zero overlap because you added a single word to your prompt, then I don’t know what to tell you.
raw_anon_1111
31 minutes ago
As if when you delegate tasks to humans they are deterministic. I would hope that your test cases cover the requirements. If not, your implementation is just as brittle when other developers come online or even when you come back to a project after six months.
sarchertech
19 minutes ago
1. Agents aren’t humans. A human can write a working 100k LOC application with zero tests (not saying they should but they could and have). An agent cannot do this.
Agents require tests to keep them from spinning out and your tests do not cover all of the behaviors you care about.
2. If you doubt that your tests don’t cover all your requirements, 99.9% of every production bug you’ve ever had completely passed your test suite.
throwaw12
28 minutes ago
Valid points. But crucial part of not "letting go" of the code is because we are responsible for that code at the moment.
If, in the future, LLM providers will take ownership of our on-calls for the code they have produced, I would write "AUTO-REVIEW-ACCEPTER" bot to accept everything and deploy it to production.
If, company requires me to own something, then I should be aware about what's that thing and understand ins and outs in detail and be able to quickly adjust when things go wrong
mikeocool
21 minutes ago
When requirements change, a compiler has the benefit of not having to go back and edit the binary it produced.
Maybe we should treat LLM generated code similarly —- just generate everything fresh from the spec anytime there’a a change, though personally I haven’t had much success with that yet.
krilcebre
26 minutes ago
You are comparing compilers to a completely non deterministic code generation tool that often does not take observable behavior into account at all and will happily screw a part of your system without you noticing, because you misworded a single prompt.
No amount of unit/integration tests cover every single use case in sufficiently complex software, so you cannot rely on that alone.
raw_anon_1111
an hour ago
You did see the part about my unit, integration and scalability testing? The testing harness is what prevents the fragility.
It doesn’t matter to AI whether the code is spaghetti code or not. What you said was only important when humans were maintaining the code.
No human should ever be forced to look at the code behind my vibe coded internal admin portal that was created with straight Python, no frameworks, server side rendered and produced HTML and JS for the front end all hosted in a single Lambda including much of the backend API.
I haven’t done web development since 2002 with Classic ASP besides some copy and paste feature work once in a blue moon.
In my repos - post AI. My Claude/Agent files have summaries of the initial statement of work, the transcripts from the requirement sessions, my well labeled design diagrams , my design review sessions transcripts where I explained it to client and answered questions and a link to the Google NotebookLM project with all of the artifacts. I have separate md files for different implemtation components.
The NotebookLM project can be used for any future maintainers to ask questions about the project based on all of the artifacts.
sarchertech
31 minutes ago
> It doesn’t matter to AI whether the code is spaghetti code or not. What you said was only important when humans were maintaining the code.
In my experience using AI to work on existing systems, the AI definitely performs much better on code that humans would consider readable.
You can’t really sit here talking about architecting greenfield systems with AI using methodology that didn’t exist 6 months ago while confidently proclaiming that “trust me they’ll be maintainable”.
Well you can, and most consultants do tend to do that, but it’s not worth much.
datsci_est_2015
an hour ago
Also developer UX, common antipatterns, etc
This “the only thing that matters about code is whether it meets requirements” is such a tired take and I can’t imagine anyone seriously spouting it has has had to maintain real software.
raw_anon_1111
44 minutes ago
The developer UX are the markdown files if no developer ever looks at the code.
Whether you are tired of it or not, absolutely no one in your value you chain - your customers who give your company money or your management chain cares about your code beyond does it meet the functional and non functional requirements - they never did.
And of course whether it was done on time and on budget
vova_hn2
an hour ago
I personally haven't made my my mind either way yet, but I imagine that a vibecoding advocate could say to you that maintaining code makes sense only when the code is expensive to produce.
If the code is cheap to produce, you don't maintain it, you just throw it away and regenerate.
sarchertech
6 minutes ago
If you have users, this only works if you have managed to encode nearly every user observable behavior into your test suite.
I’ve never seen this done even with LLMs. Not even close. And even if you did it, the test suite is almost definitely more complex than the code and will suffer from all the same maintainability problems.
mikkupikku
3 hours ago
It's not skill with talking to an LLM, it's the users skill and experience with the problem they're asking the LLM to solve. They work better for problems the prompter knows well and poorly for problems the prompter doesn't really understand.
Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.
Roxxik
3 hours ago
Not only you understanding the how, but you not understanding the goal.
I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.
One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.
So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.
Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.
FeepingCreature
an hour ago
LLMs massively reduce the cost of "let's just try this". I think trying to migrate your entire repo is usually a fool's errand. Figure out a way to break the load-bearing part of the problem out into a sub-project, solve it there, iterate as much as you like. Claude can give you a test gui in one or two minutes, as often as you like. When you have it reliably working there, make Claude write up a detailed spec and bring that back to the main project.
mikkupikku
2 hours ago
Yup, same sort of experience. If I'm fishing for something based on vibes that I can't really visualize or explain, it's going to be a slog. That said, telling the LLM the nature of my dilemma up front, warning it that I'll be waffling, seems to help a little.
__alexs
2 hours ago
I review most of the code I get LLMs to write and actually I think the main challenge is finding the right chunk size for each task you ask it to do.
As I use it more I gain more intuition about the kinds of problems it can handle on it's, vs those that I need to work on breaking down into smaller pieces before setting it loose.
Without research and planning agents are mostly very expensive and slow to get things done, if they even can. However with the right initial breakdown and specification of the work they are incredibly fast.
therealpygon
2 hours ago
I think that entirely disregarding the fundamental operation of LLMs with dismissiveness is ungrounded. You are literally saying it isn’t a skill issue while pointing out a different skill issue.
It is absolutely, unequivocally, patently false to say that the input doesn’t affect the output, and if the input has impact, then it IS a skill.
make_it_sure
2 hours ago
you are overestimating the skill of code review. Some people have very specific ways of writing code and solving problems which are not aligned what LLMs wrote, but doesn't mean it's wrong.
I know senior developers that are very radical on some nonsense patterns they think are much better than others. If they see code that don't follow them, they say it's trash.
Even so, you can guide the LLM to write the code as you like.
And you are wrong, it's a lot on how people write the prompt.
datsci_est_2015
19 minutes ago
> you are overestimating the skill of code review.
“You are overestimating the skill of [reading, comprehending, and critically assessing code of a non-guaranteed quality]” is an absurd statement if you properly expand out what “code review” means.
I don’t care if you code review the CSS file for the Bojangles online menu web page, but you better be code reviewing the firmware for my dad’s pacemaker.
This whole back and forth with LLM-generated code makes me think that the marginal utility of a lot of code the strong proponents write is <1¢. If I fuck up my code, it costs our partners $200/hr per false alert, which obliterates the profit margin of using our software in the first place.
cultofmetatron
3 hours ago
> Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding
this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.
or_am_i
3 hours ago
It's always easier to blame the model and convince yourself that you have some sort of talent in reviewing LLM's work that others don't.
In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.
loloquwowndueo
3 hours ago
Never takes long for the “you’re holding it wrong” crowd to pop in.
darkerside
3 hours ago
That's a terrible reason for a mass consumer tool to fail, and a perfectly reasonable one for a professional power tool to fail
hellosimon
2 hours ago
Partly true, but I think there's a real skill in catching subtle logic errors in generated code too not just prompting well. Both matter.
ozgrakkurt
2 hours ago
Unfortunately it is impossible to ascertain what is what from what we read online. Everyone is different and use the tools in a different way. People also use different tools and do different things with them. Also each persons judgement can be wildly different like you are saying here.
We can't trust the measurements that companies post either because truth isn't their first goal.
Just use it or don't use it depending on how it works out imo. I personally find it marginally on the positive side for coding
baxtr
2 hours ago
I thought I try to debunk your argument with a food example. I am not sure I succeeded though. Judge for yourself:
It's always easier to blame the ingredients and convince yourself that you have some sort of talent in how you cook that others don't.
In my experience the differences are mostly in how the dishes produced in the kitchen are tasted. Chefs who have experience tasting dishes critically are more likely to find problems immediately and complain they aren't getting great results without a lot of careful adjustments. And those who rarely or never tasted food from other cooks are invariably going to miss stuff and rate the dishes they get higher.
staticassertion
an hour ago
> It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
Well, it's easily the simplest explanation, right?
kasey_junk
3 hours ago
I think that code review experience is a big driver of success with the llms, but my take away is somewhat different. If you’ve spent a lot of time reviewing other people’s code you realize the failures you see with llms are common failures full stop. Humans make them too.
I also think reviewable code, that is code specifically delivered in a manner that makes code review more straightforward was always valuable but now that the generation costs have lowered its relative value is much higher. So structuring your approach (including plans and prompts) to drive to easily reviewed code is a more valuable skill than before.
JasonADrury
2 hours ago
In my experience the differences are mostly between the chair and the keyboard.
I asked Codex to scrape a bunch of restaurant guides I like, and make me an iPhone app which shows those restaurants on a map color coded based on if they're open, closed or closing/opening soon.
I'd never built an iOS app before, but it took me less than 10 minutes of screen time to get this pushed onto my phone.
The app works, does exactly what I want it to do and meaningfully improves my life on a daily basis.
The "AI can't build anything useful" crowd consists entirely of fools and liars.
ttanveer
3 hours ago
That seems to make sense. Any suggestions to improve this skill of reviewing code?
I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?
Dannymetconan
3 hours ago
It's "easy". You just spend a couple of years reviewing PRs and working in a professional environment getting feedback from your peers and experience the consequences of code.
There is no shortcut unfortunately.
vsl
2 hours ago
You improve this skill by not using LLMs more and getting more experienced as a programmer yourself. Spotting problems during review comes from experience, from having learned the lessons, knowing the codebase and libraries used etc.
christofosho
2 hours ago
Find another developer and pair/work together on a project. It doesn't need to be serious, but you should organize it like it is. So, a breakdown of tasks needed to accomplish the goal first. And then many pull requests into the source that can be peer reviewed.
stavros
3 hours ago
That's what I meant, though. I didn't mean "I say the right words", I meant "I don't give them a sentence and walk away".