Rperry2174
2 days ago
Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically
With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
that feels like a reflection of a real split in how people think llm-based coding should work...
some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result
Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.
Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means
karmasimida
2 days ago
> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.
xd1936
2 days ago
I've also had the exact opposite experience with tone. Claude Code wants to build with me, and Codex wants to go off on its own for a while before returning with opinions.
mrkstu
2 days ago
Its likely that both are steering towards the middle from their current relative extremes and converging to nearly the same place.
gervwyk
2 days ago
also my experience in using these two models. they are trying to recover from oversteer perhaps.
mercnz
a day ago
well with the recent delays i can easily find claude code going off on it's own for 20 minutes and have no idea what it's going to come back with. but one time it overflowed it's context on a simple question, and then used up the rest of my session window. in a way a lot of ai assistants have ime have this awkward thing where they complicate something in a non-visible and think about it for a long time burning up context before coming up with a summary based upon some misconception.
zen4ttitude
5 hours ago
For complex tasks I ask ChatGPT or Grok to define context then I take it to Claude for accurate execution. I also created a complete pipeline to use locally and enrich with skills, agents, RAG, profiles. It is slower but very good. There is no magic, the richer the context window the more precise and contained the execution.
esperent
a day ago
The key is a well defined task with strong guardrails. You can add these to your agents file over time or you can probably just find someone's online to copy the basics from. Any time you find it doing something you didn't expect or don't like, add guardrails to prevent that in future. Claude hooks are also useful here, along with the hookify plugin to create them for you based on the current conversation.
vorticalbox
a day ago
I have started using openspec for this. I find it works far better to have a proposal and a list of tasks the ai stays more focused.
PeterStuer
19 hours ago
In terms of 'tone', I have been very impressed with Qwen-code-next over the last 2 days, especially as I have it running locally on a single modest 4090.
turtle4
17 hours ago
Did you set that up following a guide or anything you could share?
PeterStuer
16 hours ago
Easiest way I know is to just use LMStudio. Just download and press play :). Optional, but recommended, increase the context length to 262144 if you have the DRAM available. It will definitely get slower as your interaction prolongs, but (at least for me) still tolerable speed.
mathrawka
16 hours ago
not OP, but I got it running on my 4090 (and RAM) by following this guide: https://unsloth.ai/docs/models/qwen3-coder-next
I see around 30 t/s
kamban
a day ago
Same here, CC gives me options to pick direction after the planning stage.
WilcoKruijer
2 days ago
Yes, you’re right for 4.5 and 5.2. Hence they’re focusing on improving the opposite thing and thus are actually converging.
cwyers
2 days ago
Codex now lets you tell the LLM tgings in the middle of its thinking without interrupting it, so you can read the thinking traces and tell it to change course if it's going off track.
fluidcruft
2 days ago
That just seems like a UI difference. I've always interrupted claude code added a comment and it's continued without much issue. Otherwise if you just type the message is queued for next. There's no real reason to prefer one over the other except it sounds like codex can't queue messages?
int_19h
a day ago
Codex can queue messages, but the queue only gets flushed once the agent is done with whatever it was working on, whereas Claude will read messages and adjust accordingly in the middle of whatever it is doing. It sounds like OP is saying that Codex can now do this latter bit as well.
esperent
a day ago
The problem is if you're using subagents, the only way to interject is often to press escape multiple times which kills all the running subagents. All I wanted to do was add a minor steering guideline.
This might be better with the new teams feature.
Skwrm
21 hours ago
They actually made a change a few weeks ago that made subagents more steerable
When they ask approval for a tool call, press down til the selector is on "No" and press tab, then you can add any extra instructions
cruffle_duffle
a day ago
That is so annoying too because it basically throws away all the work the subagent did.
Another thing that annoys me is the subagents never output durable findings unless you explicitly tell their parent to prompt the subagent to “write their output to a file for later reuse” (or something like that anyway)
I have no idea how but there needs to be ways to backtrack on context while somehow also maintaining the “future context”…
bt1a
2 days ago
This is most likely an inference serving problem in terms of capacity and latency given that Opus X and the latest GPT models available in the API have always responded quickly and slowly, respectively
ghosty141
2 days ago
I'm personally 100% convinced (assuming prices stay reasonable) that the Codex approach is here to stay.
Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.
It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.
LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.
_zoltan_
2 days ago
I'm personally 100% convinced of the opposite, that it's a waste of time to steer them. we know now that agentic loops can converge given the proper framing and self-reflectiveness tools.
sealeck
2 days ago
Converge towards what though... I think the level of testing/verification you need to have an LLM output a non-trivial feature (e.g. Paxos/anything with concurrency, business logic that isn't just "fetch value from spreadsheet, add to another number and save to the database") is pretty high.
replygirl
2 days ago
in the new world, engineers have to actually be good at capturing and interpreting requirements
halfcat
a day ago
In this new world, why stop there? It would be even better if engineers were also medical doctors and held multiple doctorate degrees in mathematics and physics and also were rockstar sales people.
NamlchakKhandro
a day ago
sounds like the kinds of hyperbole someone whose just been forced to set a linter for the first time
craigdalton
a day ago
As a doctor, this sounds like an engineers job.
zeroxfe
2 days ago
> it's a waste of time to steer them
It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.
There's a balance to strike between micro-management and no steering at all.
adw
a day ago
The prompt is decreasingly relevant. The verification environment you have is what actually matters.
freakynit
a day ago
I think this all comes down to information.
Most prompts we give are severely information-deficient. The reason LLMs can still produce acceptable results is because they compensate with their prior training and background knowledge.
The same applies to verification: it's fundamentally an information problem.
You see this exact dynamic when delegating work to humans. That's why good teams rely on extremely detailed specs. It's all a game of information.
bcarv
2 days ago
Does the AI agent know what your company is doing right now, what every coworker is working on, how they are doing it, and how your boss will change priorities next month without being told?
If it really knows better, then fire everyone and let the agent take charge. lol
hyldmo
2 days ago
No, but Codex wouldn’t have asked you those questions either
bcarv
2 days ago
For me, it still asks for confirmation at every decision when using plans. And when multiple unforeseen options appear, it asks again. I don’t think you’ve used Codex in a while.
IMTDb
a day ago
A significant portion of engineering time is now spent ensuring that yes, the LLM does know about all of that. This context can be surfaced through skills, MCP, connectors, RAG over your tools, etc. Companies are also starting to reshape their entire processes to ensure this information can be properly and accurately surfaced. Most are still far from completing that transformation, but progress tends to happen slowly, then all at once.
bcarv
a day ago
Shut up, bot. Nobody wants to change anything. Management is still run by the same idiots, changing their minds every day. No mcp will fix that.
generallyjosh
11 hours ago
All we can do is try our best to look at the world with clear eyes, and think about where the industry's going over the next couple years
Not how we want things to be, but how they actually are and will be
I don't think AI for programming is a passing fad
jondwillis
21 hours ago
Who hurt you?
Also what are you even proposing/advocating for here?
This meta-state-of-company context is just as capturable as anything else with the right lines of questioning and spyware and UI/UX to elicit it.
rapind
a day ago
Maybe some day, but as a claude code user it makes enough pretty serious screw ups, even with a very clearly defined plan, that I review everything it produces.
You might be able to get away without the review step for a bit, but eventually (and not long) you will be bitten.
jaggederest
a day ago
I use that to feed back into my spec development and prompting and CI harnesses, not steering in real time.
Every mistake is a chance to fix the system so that mistake is less likely or impossible.
I rarely fix anything in real time - you review, see issues, fix them in the spec, reset the branch back to zero and try again. Generally, the spec is the part I develop interactively, and then set it loose to go crazy.
This feels, initially, incredibly painful. You're no longer developing software, you're doing therapy for robots. But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
Terretta
a day ago
> You're no longer developing software, you're doing therapy for robots.
Or, really, hacking in "learning", building your knowhow-base.
> But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
Strong yes to both, so strong that it's curious Claude Code, Codex, Claude Cowork, etc., don't yet bake in an explicit knowledge evolution agent curating and evolving their markdown knowledge base:
https://github.com/anthropics/knowledge-work-plugins
Unlikely to help with benchmarks. Very likely to improve utility ratings (as rated by outcome improvements over time) from teams using the tools together.
For those following along at home:
This is the return of the "expert system", now running on a generalized "expert system machine".
rapind
a day ago
I assumed you'd build such a massive set of rules (that claude often does not obey) that you'd eat up your context very quickly. I've actually removed all plugins / MCPs because they chewed up way too much context.
jaggederest
a day ago
It's as much about what to remove as what to add. Curation is the key. Skills also give you some levers to get the kind of context-sensitive instruction you need, though I haven't delved too deeply into them. My current total instruction set is around ~2500 tokens at the moment
vidarh
20 hours ago
Reviewing what it produces once it thinks it has met the acceptance criteria and the test suite passes is very different from wasting time babysitting every tiny change.
rapind
15 hours ago
True, and that's usually what I'm doing now, but to be honest I'm also giving all of it's code at least a cursory glance.
Some of the things it occasionally does:
- Ignores conventions (even when emphasized in the CLAUDE.md)
- Decides to just not implement tests if gets spins out on them too much (it tells you, but only as it happens and that scrolls by pretty quick)
- Writes badly performing code (N+1)
- Does more than you asked (in a bad way, changing UIs or adding cruft)
- Makes generally bad assumptions
I'm not trying to be overly negative, but in my experience to date, you still need to babysit it. I'm interested though in the idea of using multiple models to have them perform independent reviews to at least flag spots that could use human intervention / review.
vidarh
37 minutes ago
Sure, but non of those things requires you to watch it work. They're all easy to pick up on when reviewing a finished change, which ideally should come after it's instructions have had it run linters, run sub agents that verify it has added tests, run sub agents doing a code review.
I don't want to waste my time reviewing a change the model can still significantly improve all by itself. My time costs far more than the models.
_zoltan_
2 hours ago
then you're using it wrong, to be frank with you.
you give it tools so it can compile and run the code. then you give it more tools so it can decide between iterations if it got closer to the goal or not. let it evaluate itself. if it can't evaluate something, let it write tests and benchmark itself.
I guarantee that if the criteria is very well defined and benchmarkable, it will do the right thing in X iterations.
(I don't do UI development. I do end-to-end system performance on two very large code bases. my tests can be measured. the measure is very simply binary: better or not. it works.)
halfcat
a day ago
> given the proper framing
This sounds like never. Most businesses are still shuffling paper and couldn’t give you the requirements for a CRUD app if their lives depended on it.
You’re right, in theory, but it’s like saying you could predict the future if you could just model the universe in perfect detail. But it’s not possible, even in theory.
If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.
If you can’t fully describe the thing, like some general “make more profit” or “lower costs”, you’re in paper clip maximizer territory.
jondwillis
21 hours ago
> If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.
Trying to get my company to realize this right now.
Probably the most efficient way to work, would be on a video call including the product person/stakeholder, designer, and me, the one responsible for the actual code, so that we can churn through the now incredibly fast and cheap implementation step together in pure alignment.
You could probably do it async but it’s so much faster to not have to keep waiting for one another.
retinaros
14 hours ago
good luck.
_zoltan_
2 hours ago
I've been working on very complex problems with this model and the results I have have surprised people over and over again.
xXSLAYERXx
21 hours ago
I've been using codex for one week and I have been the most productive I have ever been. Small prs, tight rules, I get almost exactly what I want. Things tend to go sideways when scope creeps into my request. But I just close the PR instead of fighting with the agent. In one week: 28 prs, 26 merged. Absolutely unreal.
vidarh
20 hours ago
I will personally never consider using an agent that can't be easily pushed toward working on its own for long periods (hours) at a time. It's a total waste of time for me to babysit the LLM.
sejje
a day ago
Aider was doing this a long time ago
Skidaddle
a day ago
But tokens are way cheaper than human labor
NuclearPM
2 days ago
> If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to
That could easily be automated.
utilize1808
2 days ago
I think it's the opposite. Especially considering Codex started out as a web app that offers very little interactivity: you are supposed to drop a request and let it run automatously in a containerized environment; you can then follow up on it via chat --- no interactive code editing.
Rperry2174
2 days ago
Fair I agree that was true of early codex and my perception too.. but today there are two announcements that came out and thats what im referring to.
specifically, the GPT-5.3 post explicitly leans into "interactive collaborator" langauge and steering mid execution
OpenAI post: "Much like a colleague, you can steer and interact with GPT-5.3-Codex while it’s working, without losing context."
OpenAI post: "Instead of waiting for a final output, you can interact in real time—ask questions, discuss approaches, and steer toward the solution"
Claude post: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user."
stingraycharles
a day ago
I think those OpenAI announcements are mainly because this hasn’t been the case for them earlier, while it has been part of Claude Code since the beginning.
I don’t think there’s something deeply philosophical in here, especially as Claude Code is pushing stronger for asking more questions recently, introduced functionality to “chat about questions” while they’re asked, etc.
user34283
a day ago
When I tried 5.2 Codex in GitHub Copilot it executed some first steps like searching for the relevant files, then it output the number "2" and stopped the response.
On further prompting it did the next step and terminated early again after printing how it would proceed.
It's most likely just a bug in GitHub Copilot, but it seems weird to me that they add models that clearly don't even work with their agentic harness.
fluidcruft
2 days ago
Frankly it seems to be that codex is playing catch-up with claude code and claude code is just continuing to move further ahead. The thing with claude code is it will work longer... if you want it to. It's always had good oversight and (at least for me) it builds trust slowly until you are wishing it would do more at once. When I've used codex (it has been getting better) but back in the day it would just do things and say it's done and you're just sitting there wondering "wtf are you doing?". Claude code is more the opposite where you can watch as closely as you want and often you get to a point where you have enough trust and experience with it that you know what it's going to do and don't want to bother.
mcintyre1994
2 days ago
This kind of sounds like both of them stepping into the other’s turf, to simplify a bit.
I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6
So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.
giancarlostoro
2 days ago
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.
Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.
Rperry2174
2 days ago
yeah I'm mostly just talking about how they're framing it: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user"
I guess its also quite interesting that how they are framing these projects are opposite from how people currently perceive them and I guess that may be a conscious choice...
giancarlostoro
2 days ago
I get what you mean now, I like that to be fair, sometimes I want Claude to tell me some architectural options, so I ask it so I can think about what my options are, sometimes I rethink my problem if I like Claudes conclusion.
jhancock
2 days ago
Good breakdown.
I usually want the codex approach for code/product "shaping" iteratively with the ai.
Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.
I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.
techbro_1a
2 days ago
> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5
bob1029
2 days ago
I think there is another philosophy where the agent is domain specific. Not that we have to invent an entirely new universe for every product or business, but that there is a small amount of semi-customization involved to achieve an ideal agent.
I would much rather work with things like the Chat Completion API than any frameworks that compose over it. I want total control over how tool calling and error handling works. I've got concerns specific to my business/product/customer that couldn't possibly have been considered as part of these frameworks.
Whether or not a human needs to be tightly looped in could vary wildly depending on the specific part of the business you are dealing with. Having a purpose-built agent that understands where additional verification needs to occur (and not occur) can give you the best of both worlds.
dimgl
a day ago
Did you get those backwards? Codex, Gemini, etc. all wait until the requests are done to accept user feedback. Claude Code allows you to insert messages in between turns.
aurareturn
a day ago
Codex added an experimental feature to allow steering mid task.
aulin
a day ago
Admit I didn't follow the announcements but isn't that a matter of UI? Doesn't seem something that should be baked in the model but in the tooling around it and the instructions you give them. E.g. I've been playing with with GitHub copilot CLI (that despite the bad fame is absolutely amazing) and the same model completely changes its behavior with the prompt. You can have it answer a question promptly or send it on a multi-hour multi-agent exploration writing detailed specs with a single prompt. Or you can have it stop midway for clarification. It all depends on the instructions. Also this is particularly interesting with GitHub billing model as each prompt counts 1 request no matter how many tokens it burns.
F7F7F7
a day ago
It depends honestly. Both are prone to doing the exact opposite of what you asked. Especially with poor context management.
I’ve had both $200 plans and now just have Max x20 and use the $20 ChatGPT plan for an inferior Codex.
My experience (up until today) has always been that Codex acts like that one Sr Engineer that we all know. They are kind of a dick. And will disappear into a dark hole and emerge with a circle when you asked for a pentagon. Then let you know why edges are bad for you.
And yes, Anthropic is pivoting hard into everything agentic. I bet it’s not too long before Claude Code stops differentiating models. I had Opus blow 750k tokens on a single small task.
cchance
2 days ago
Just because you can inject steering doesn't mean they stered away from long running...
Theres hundreds of people who upload Codex 5.2 running for hours unattended and coming back with full commits
mdale
a day ago
I think it's just both companies building/ marketing to the strength of their competitor. As general perception has been the opposite for codex and Opus respectfully.
sfmike
a day ago
It's the opposite? codex course corrects and is self inquisitive. opus is just wrong and need to refeed it it's wrong.
hbarka
2 days ago
How can they be diverging, LLMs are built on similar foundations aka the Transformer architecture. Do you mean the training method (RLHF) is diverging?
iranintoavan
2 days ago
I'm not OP but I suspect they are meaning the products / tooling / company direction, not necessarily the underlying LLM architecture.
dboon
a day ago
…what? It is quite literally the opposite. This isn’t a matter of taste or perception.
mi_lk
13 hours ago
It’s the opposite way
blurbleblurble
2 days ago
Funny cause the situation was totally flipped last iteration.
pyrolistical
2 days ago
Boing vs airbus philosophy
rippeltippel
a day ago
Grabbing popcorn...
rozumbrada
2 days ago
I read this exact comment with I would say completely the same words several times in X and I would bet my money it's LLM generated by someone who has not even tried both the tools. This AI slop even in the site like this without direct monetisation implications from fake engagement is making me sick...
drsalt
a day ago
be rich, hire an ai guy, let him deal with it
d--b
2 days ago
I am definitely using Opus as an interactive collaborator that I steer mid-execution, stay in the loop and course correct as it works.
I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.