inerte
9 hours ago
Codex has always been better at following agents.md and prompts more, but I would say in the last 3 months both Claude Code got worse (freestyling like we see here) and Codex got EVEN more strict.
80% of the time I ask Claude Code a question, it kinda assumes I am asking because I disagree with something it said, then acts on a supposition. I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.
Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.
With this project I am doing, because I want to be more strict (it's a new programming language), Codex has been the perfect tool. I am mostly using Claude Code when I don't care so much about the end result, or it's a very, very small or very, very new project.
kace91
9 hours ago
>I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.
Funny to read that, because for me it's not even new behavior. I have developed a tendency to add something like "(genuinely asking, do not take as a criticism)".
I'm from a more confrontational culture, so I just assumed this was just corporate American tone framing criticism softly, and me compensating for it.
ddoolin
9 hours ago
Same here. I quickly learned that if you merely ask questions about it's understanding or plans, it starts looking for alternatives because my questioning is interpreted as rejection or criticism, rather than just taking the question at face value. So I often (not always) have to caveat questions like that too. It's really been like that since before Claude Code or Codex even rolled around.
It's just strange because that's a very human behavior and although this learns from humans, it isn't, so it would be nice if it just acted more robotic in this sense.
muyuu
6 hours ago
Do what you would do with a person, which is to allocate time for them to produce documentation, and be specific about it.
VortexLain
8 hours ago
Appending "Good." before clarifying questions actually helps with that suprisingly well.
cardanome
7 hours ago
Oh funny enough, I often add stuff like "genuinely asking, do not take as a criticism" when talking with humans so I do it naturally with LLMs.
People often use questions as an indirect form of telling someone to do something or criticizing something.
I definitely had people misunderstand questions for me trying to attack them.
There is a lot of times when people do expect the LLM to interpret their question as an command to do something. And they would get quite angry if the LLM just answered the question.
Not that I wouldn't prefer if LLMs took things more literal but these models are trained for the average neurotypical user so that quirk makes perfect sense to me.
mikepurvis
9 hours ago
I've been using chat and copilot for many months but finally gave claude code a go, and I've been interested how it does seem to have a bit more of an attitude to it. Like copilot is just endlessly patient for every little nitpick and whim you have, but I feel like Claude is constantly like "okay I'm committing and pushing now.... oh, oh wait, you're blocking me. What is it you want this time bro?"
nineteen999
8 hours ago
"Don't act, just a question" works for me.
d1sxeyes
8 hours ago
Try /btw
JSR_FDED
6 hours ago
This is the prompt that Claude Code adds when you use /btw
https://github.com/Piebald-AI/claude-code-system-prompts/blo...
nineteen999
8 hours ago
That's not a thing in Claude ... so no.
ashenke
7 hours ago
It actually is, don't know for how long but it prompted me to try this a few days ago
andyferris
8 hours ago
It's new
closewith
8 hours ago
It is in Claude Code, specifically for this use case.
abrookewood
7 hours ago
You can just put it in PLAN mode (assuming VS Code), that works well enough - never seen it edit code when in that state.
112233
3 minutes ago
I tried using codex, and it is great (meaning - boring) when it works. My problem is it does not work. Let me explain
codex> Next I can make X if you agree.
me> ok
codex> I will make X now
me> Please go on
codex> Great, I am starting to work on X now
me> sure, please do
codex> working on X, will report on completion
me> yo good? please do X!
... and so on. Sometimes one round, sometimes four, plus it stops after every few lines to "report progress" and needs another nudge or five. :(
lubujackson
9 hours ago
I feel like people are sleeping on Cursor, no idea why more devs don't talk about it. It has a great "Ask" mode, the debugging mode has recently gotten more powerful, and it's plan mode has started to look more like Claude Code's plans, when I test them head to head.
bushido
9 hours ago
Cursor implemented something a while back where it started acting like how ChatGPT does when it's in its auto mode.
Essentially, choosing when it was going to use what model/reasoning effort on its own regardless of my preferences. Basically moved to dumber models while writing code in between things, producing some really bad results for me.
Anecdotal, but the reason I will never talk about Cursor is because I will never use it again. I have barred the use of Cursor at my company, It just does some random stuff at times, which is more egregious than I see from Codex or Claude.
ps. I know many other people who feel the same way about Cursor and other who love it. I'm just speaking for myself, though.
ps2. I hope they've fixed this behavior, but they lost my trust. And they're likely never winning it back.
sroussey
8 hours ago
Don’t use the “auto” model and you will be fine.
You just described their “auto” behavior, which I’m guessing uses grok.
Using it with specific models is great, though you can tell that Anthropic is subsidizing Claude Code as you watch your API costs more directly. Some day the subsidy will end. Enjoy it now!
And cursor debugging is 10x better, oh my god.
I have switched to 70% Claude Code, 10% Copilot code reviews (non anthropic model), and 20% Cursor and switch the models a bit (sometimes have them compete — get four to implement the same thing at the same time, then review their choices, maybe choose one, or just get a better idea of what to ask for and try again).
clbrmbr
8 hours ago
Same here. Auto mode is NOT ok. Sadly, smaller models cannot be trusted with access to Bash.
ponyous
9 hours ago
In the coworking I am in people are hitting limits on 60$ plan all the time. They are thinking about which models to use to be efficient, context to include etc…
I’m on claude code $100 plan and never worry about any of that stuff and I think I am using it much more than they use cursor.
Also, I prefer CC since I am terminal native.
adwn
36 minutes ago
Tell them to use the Composer 1.5 model. It's really good, better than Sonnet, and has much higher usage limits. I use it for almost all of my daily work, don't have to worry about hitting the limit of my 60$ plan, and only occasionally switch to Opus 4.6 for planning a particularly complex task.
dagss
6 hours ago
I used to love Cursor but as I started to rely on agent more and more it just got way too tedious having to Accept every change.
I ended up spending time just clicking "Accept file" 20x now and then, accepting changes from past 5 chats...
PR reviews and tying review to git make more sense at this point for me than the diff tracking Cursor has on the side.
Cancelling my cursor before next card charge solely due to the review stuff.
hansonkd
9 hours ago
I love to build a plan, then cycle to another frontier model to iterate on it.
onion2k
2 hours ago
Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.
This is important, but as a warning. At least in theory your agent will follow everything that it has in context, but LLMs rely on 'context compacting' when things get close to the limit. This means an LLM can and will drop your explicit instructions not to do things, and then happily do them because they're not in the context any more. You need to repeat important instructions.
tomtomistaken
35 minutes ago
For Claude writing "let's discuss" at the end of the prompt seems to do it
AlotOfReading
9 hours ago
I've had some luck taming prompt introspection by spawning a critic agent that looks at the plan produced by the first agent and vetos it if the plan doesn't match the user's intentions. LLMs are much better at identifying rule violations in a bit of external text than regulating their own output. Same reason why they generate unnecessary comments no matter how many times you tell them not to.
miohtama
8 hours ago
How does one integrate critic agent to a Codex/Claude?
bentcorner
8 hours ago
I just say something like "spawn an agent to review your plan" or something to that effect. "Red/green TDD" is apparently the nomenclature: https://simonwillison.net/guides/agentic-engineering-pattern...
I've also found it to be better to ask the LLM to come up with several ideas and then spawn additional agents to evaluate each approach individually.
I think the general problem is that context cuts both ways, and the LLM has no idea what is "important". It's easier to make sure your context doesn't contain pink elephants than it is to tell it to forget about the pink elephants.
AlotOfReading
7 hours ago
You can just say spawn an agent as the sibling says. I didn't find that reliable enough, so I have a slightly more complicated setup. First agent has no permissions except spawning agents and reading from a single directory. It spawns the planner to generate the plan, then either feeds it to the critic and either spawns executors or re-runs the planner with critic feedback. The planner can read and write. The critic agent can only read the input and outputs accept/reject with reason.
This is still sometimes flaky because of the infrastructure around it and ideally you'd replace the first agent with real code, but it's an improvement despite the cost.
0xbadcafebee
5 hours ago
This is mostly dependent on the agent because the agent sets the system prompt. All coding agents include in the system prompt the instruction to write code, so the model will, unless you tell it not to. But to what extent they do this depends on that specific agent's system prompt, your initial prompt, the conversation context, agent files, etc.
If you were just chatting with the same model (not in an agent), it doesn't write code by default, because it's not in the system prompt.
thomaslord
7 hours ago
This is extra rough because Codex defaults to letting the model be MUCH more autonomous than Claude Code. The first time I tried it out, it ended up running a test suite without permission which wiped out some data I was using for local testing during development. I still haven't been able to find a straight answer on how to get Codex to prompt for everything like Claude Code does - asking Codex gets me answers that don't actually work.
niobe
4 hours ago
But that's one of the first things you fix in your CLAUDE.md: - "Only do what is asked." - "Understand when being asked for information versus being asked to execute a task."
bdangubic
4 hours ago
This - per extensive experiments - works about as well as when I tell my wife to calm down
smackeyacky
an hour ago
Asking might work better than telling
stavros
9 hours ago
I've added an instruction: "do not implement anything unless the user approves the plan using the exact word 'approved'".
This has fixed all of this, it waits until I explicitly approve.
xeckr
9 hours ago
"NOT approved!"
"The user said the exact word 'approved'. Implementing plan."
Terr_
8 hours ago
Relevant comedy scene from Idiocracy (2006):
SsgMshdPotatoes
7 hours ago
Lol it only took 20 years
AnotherGoodName
9 hours ago
There’s an extension to this problem which I haven’t got past. More generally I’d like the agent to stop and ask questions when it encounters ambiguity that it can’t reasonably resolve itself. If someone can get agents doing this well it’d be a massive improvement (and also solve the above).
stavros
9 hours ago
Hm, with my "plan everything before writing code, plus review at the end" workflow, this hasn't been a problem. A few times when a reviewer has surfaced a concern, the agent asks me, but in 99% of cases, all ambiguity is resolved explicitly up front.
skeeter2020
8 hours ago
what gung-ho, talented-but-naive junior developer has ever done that?
clarus
8 hours ago
The solution for this might be to add a ME.md in addition to AGENT.md so that it can learn and write down our character, to know if a question is implicitly a command for example.
chrysoprace
6 hours ago
Maybe I should give Codex a go, because sometimes I just want to ask a question (Claude) and not have it scan my entire working directory and chew up 55k tokens.
hrimfaxi
9 hours ago
> Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.
Can you speak more to that setup?
inerte
9 hours ago
Claude Code goes through some internal systems that other tools (Cline / Codex / and I think Cursor) do not. Also we have different models for each. I don't know in practice what happens, but I found that Codex compacts conversations way less often. It might as well be somehow less tokens are used/added, then raw context window size. Sorry if I implied we have more context than whatever others have :)
rsanheim
7 hours ago
Codex does something sorta magical where it auto compacts, partially maybe, when it has the chance. I don’t know how it works, and there is little UI indication for it.
hun3
4 hours ago
Does appending "/genq" work?
Or use the /btw command to ask only questions
parhamn
9 hours ago
I added an "Ask" button my agent UI (openade.ai) specifically because of this!
user3939382
3 hours ago
Claude Code is perfectly happy to toggle between chat and work but if you’re simply clear about which you want. Capital letters aren’t necessary.
darkoob12
9 hours ago
This is not Claude Code. And my experience is the opposite. For me Codex is not working at all to the point that it's not better than asking the chat bot in the browser.
pprotas
2 hours ago
This comment is right, this screenshot is not Claude Code. It’s Opencode.
thomasfromcdnjs
8 hours ago
A lot of people dunking but as this comment says, it is not claude code. (just opus 4.6)
casey2
8 hours ago
For the last 12 months labs have been 1. check-pointing 2. train til model collapse 3. revert to the checkpoint from 3 months ago 4. People have gotten used to the shitty new model Antropic said they "don't do any programming by hand" the last 2 years. Antropic's API has 2 nines
cmrdporcupine
9 hours ago
I'm back on Claude Code this month after a month on Codex and it's a serious downgrade.
Opus 4.6 is a jackass. It's got Dunning-Kruger and hallucinates all over the place. I had forgotten about the experience (as in the Gist above) of jamming on the escape key "no no no I never said to do that." But also I don't remember 4.5 being this bad.
But GPT 5.3 and 5.4 is a far more precise and diligent coding experience.
sroussey
8 hours ago
Use cli or extension or the app?