hackernews client

Claude Code on the web

578 pointsposted 4 months ago

404 Comments

mmaunder

4 months ago

We were heavy users of Claude Code ($70K+ spend per year) and have almost completely switched to codex CLI. I'm doing massive lifts with it on software that would never before have been feasible for me personally, or any team I've ever run. I'll use Claude Code maybe once every two weeks as a second set of eyes to inspect code and document a bug, with mixed success. But my experience has been that initially Claude Code was amazing and a "just take my frikkin money" product. Then Codex overtook CC and is much better at longer runs on hard problems. I've seen Claude Code literally just give up on a hard problem and tell me to buy something off the shelf. Whereas Codex's ability to profoundly increase the capabilities of a software org is a secret that's slowly getting out.

I don't have any relationship with any AI company, and honestly I was rooting for Anthropic, but Codex CLI is just way way better.

Also Codex CLI is cheaper than Claude Code.

I think Anthropic are going to have to somehow leapfrog OpenAI to regain the position they were in around June of this year. But right now they're being handed their hat.

latexr

4 months ago

Feels like with every announcement there’s the same comment: “this LLM tool I’m using now is the real deal, the thing I was using previously and spending stupid amounts of money on looked good but failed at XYZ, this new thing is where it’s at”. Rinse and repeat.

Which means it wasn’t true any of the previous times, so why would it be true this time? It feels like an endless loop of the “friendship ended” meme with AI companies.

https://knowyourmeme.com/editorials/guides/what-is-the-frien...

It’s much more likely commenters are still in the honeymoon hype phase and (again) haven’t found the problems because they’re hyper focused on what the new thing is good at that the previous one wasn’t, ignoring the other flaws. I see that a lot with human relationships as well, where people latch on to new partners because they obviously don’t have the big problem that was a strain on the previous relationship. But eventually something else arises. Rinse and repeat.

iammrpayments

4 months ago

It could also be covert advertising like you see in reddit

jimmydoe

4 months ago

trust your instinct. internet is dead.

joenot443

4 months ago

Actually, there are more internet users today than at any point in history. The internet is far from dead.

The commenter in question is the CTO of the company which makes Wordfence. My instinct says they're not on the OpenAI payroll and you're looking at a normal comment and not advertisement.

I think you should check your priors man; it's worth thinking critically before you toss out accusations like that.

mmaunder

4 months ago

Thanks joenot443.

I suspect the only way to prove that I’m legit to the doubters is to do something a paid shill or a bot would never ever do.

To the commenters who think I’m a shill or bot: fuck every single one of you and the various motherfucking horses you rode in on.

I suspect we may be entering a dystopia where vulgarity is proof of life.

jimmydoe

4 months ago

sorry this triggered you, i regret post what i posted, as i didn't expect it would really upset any hn user, but it did to you, so I'm sorry.

also i was referring to broadly the phenomenon not your post, e.g. even your post is from real human, it's the replies and upvotes push your post to the top.

i don't expect to convince you, but if there's anything I can do to un-upset you, I'd happy to try. :)

Dilettante_

4 months ago

If it payed, shills and bots would use every slur under the sun without a shred of hesitation.

wayeq

4 months ago

> fuck every single one of you and the various motherfucking horses you rode in on.

Maybe you're an LLM trained on 4chan

iammrpayments

4 months ago

I actually think people on 4chan and even reddit don’t get angry that quickly because it’s an anonymous posting board so there’s nothing to be defensive about, unless you’re really invested on an opinion, which makes me suspect this even more, why would he start cursing when he was just bragging about spending 70k?

It I posted on reddit about how I just spent 70k on a watch and someone replied that they didn’t trust me, maybe I would laugh or reply with “whatever”, but never would I reply in anger.

latexr

4 months ago

Maybe you wouldn’t, but something the past few years have made abundantly clear is that that having more money is not correlated to having a thicker skin and being able to ignore criticism. Or if it is, it’s an inverse correlation.

alt187

4 months ago

It's probably the same sort of sanity that prevents you from spending 70k on Cla— on a watch, and prevents you from replying in anger.

mmaunder

4 months ago

lol - touche. Although I suspect that would have been far worse had that been the case.

SpaceNoodled

4 months ago

aw, my horse.

LexiMax

4 months ago

I would agree with you, but mmaunder has a registration date of 2007 and has a bio filled out with easily-verifiable information.

...is what a reasonable argument against would sound like. But in truth, nobody really knows who is running that account. There's nothing stopping anybody from passing off their HN account to someone else, having it stolen from them, or even selling it. They could possibly even be who they say they are, but have an undisclosed vested interest in the thing they're promoting.

Internet communities aren't dead, but social media sure is, and Hacker News is ultimately a social media site.

igneo676

4 months ago

People just aren't used to how LLMs and their tools are developed

Depending on the time of the year you can expect fresh updates from any given company on how their new models and tools perform and they'll generally blow the competition out of the water.

The trick is to either realize that your current tools will just become magically better in a few months OR lean in and switch companies as their tools and models update.

tim333

4 months ago

I think it's just a function of the models getting better. One is the best and then next month another overtakes it as so on.

raducu

4 months ago

> “this LLM tool I’m using now is the real deal".

GPT-5 is not the final deal, but it's incredibly good as is at coding.

Anecdotal, but it's something completely else in terms of capabilities, ignore it at your own peril, but I think it will profoundly change software development.

latexr

4 months ago

> ignore it at your own peril

I’m not arguing for ignoring it, my point is different.

> but I think it will profoundly change software development.

The point is that this is said every time, together with “the previous thing to which the exact same praise was given, wasn’t it”. So it’s several rounds of “yes yes, the previous time the criticisms were right, but this time it’s different, trust me”. So everyone else is justified in being skeptical.

No one wants AI to have the problems it has (technical, ethical, and others). If they didn’t it would be better for everyone. Criticism is a way of surfacing issues so they can be fixed.

And sure, I’ll grant that some people want to bash the other side more than they want to arrive at the truth, but those exist in all sides of the argument (probably in roughly equal measure?). So to have a productive conversation we need to go in with the mindset of “we’re on the same side in the goal of not having this suck”.

blizdiddy

4 months ago

So what is your thesis? The tools keep getting better, so that’s some kind of gotcha that the emporer has no clothes? Some people prefer the absolute latest and greatest so people on the previous gen were all fakers making Pelican svgs?

Maybe the productive thing is actually to ignore naysayers and goalpost movers and use the tools.

You aren’t enlightened for not liking a tool. “Oh, hammers? Absolutely a bubble, after all they never fixed the hit-your-thumb issue i blogged about, and nail guns just let you hurt your thumbs faster”

latexr

4 months ago

No, that is not my thesis, and nowhere in my post do I talk about a dislike for a particular tool or “being enlightened”.

I’ll say it again:

> So to have a productive conversation we need to go in with the mindset of “we’re on the same side in the goal of not having this suck”.

If you’re unwilling to engage in those terms and steel man the argument, I don’t see the point in engaging in conversation. If what you want is to straw man and throw unsubstantiated jabs at someone, there are other communities better suited for that.

raducu

4 months ago

> So it’s several rounds of “yes yes, the previous time the criticisms were right, but this time it’s different, trust me”. So everyone else is justified in being skeptical.

True, but even the boy who cried wolf too many times eventually got his sheep eaten by the wolf.

I have my own personal anecdotal benchmarks and I never hyped LLMs before GPT-5.

Things that simply did not work before GPT-5 no matter how many shots I gave them, GPT-5 breezed through.

For me, it would take at least 2 generations of no felt progress in the models to call for diminishing returns, and I'm not seeing them.

hitarpetar

4 months ago

oh it's anecdata? then I don't care. show me evidence please

zsoltkacsandi

4 months ago

From my experience this mostly happened to Antrophic models, and not because of some honeymoon period, but after the introduction of their models, the model quality and the limits are starting to decline.

Many people are complaning about this on HN and Reddit. I do not have any proof, but there is a pattern, I suppose Antrophic first attracts customers, then starts to optimize costs/margins.

latexr

4 months ago

> Many people are complaning about this on HN and Reddit. I do not have any proof, but there is a pattern, I suppose Antrophic first attracts customers, then starts to optimize costs/margins.

https://en.wikipedia.org/wiki/Enshittification

nickstinemates

4 months ago

Another way to look at it is it's very cut throat. Switching costs are low for now, by design. The ways to get you locked in haven't been developed.

panarky

4 months ago

Results > memes.

If we ship more for less because the new agent doesn't tap out, that's not a honeymoon, it's an upgrade.

palata

4 months ago

"If", indeed. That's the problem of metrics: it depends on what you measure, and sometimes it's hard to get a meaningful answer in the short term.

If you ship more for less, but less maintainable or less correct, then it's not necessarily an upgrade. Always the same question: does it benefit the developer? The product? The company?

It was already possible, without AI, to look like one is doing a great job ("they are producing so much! Let's promote them!") but actually just building a bad codebase. The art being to get the promotion and move to the next step before the project implodes.

Not saying that AI necessarily ends up doing that, but it most certainly help.

Bombthecat

4 months ago

Because there is no switching cost, it's easy to switch, just change the API and be done

sunnybeetroot

4 months ago

With CLI tools like Claude code and codex cli there is some friction due to the change in user experience ie keyboard shortcuts, commands etc

jswny

4 months ago

I find Codex CLI to be very good too, but it’s missing tons of features that I use in Claude Code daily that keep me from switching full time.

- Good bash command permission system

- Rollbacks coupled with conversation and code

- Easy switching between approval modes (Claude had a keybind that makes this easy)

- Ability to send messages while it’s working (Codex just queues them up for after it’s done, Claude injects them into the current task)

- Codex is very frustrating when I have to keep allowing it to run the same commands over and over, Claude this works well when I approve it to run a command for the session

- Agents (these are very useful for controlling context)

- A real plan mode (crucial)

- Skills (these are basically just lazy loaded context and are amazing)

- The sandboxing in codex is so confusing, commands fail all the time because they try to log to some system directory or use internet access which is blocked by default and hard to figure out

- Codex prefers python snippets to bash commands which is very hard to permission and audit

When Codex gets to feature parity, I’ll seriously look at switching, but until then it’s just a really good model wrapped in an okay harness

libraryofbabel

4 months ago

I don't think anyone can reasonably argue against Claude Code being the most full-featured and pleasant to use of the CLI coding agent tools. Maybe some people like the Codex user experience for idiosyncratic reasons, but it (like Gemini CLI) still feels to me rather thrown together - a Claude Clone with a lot of rough edges.

But these CLI tools are still fairly thin wrappers around an LLM. Remember: they're "just an LLM in a while loop with access to tool calls." (I exaggerate, and I love Claude Code's more advanced features like "skills" as much as anyone, but at the core, that's what they are.) The real issue at stake is what is the better LLM behind the agent: is GPT-5 or Sonnet 4.5 better at coding. On that I think opinion is split.

Incidentally, you can run Claude Code with GPT-5 if you want a fair(er) comparison. You need a proxy like LiteLLM and you will have to use the OpenAI api and pay per-token, but it's not hard to do and quite interesting. I haven't used it enough to make a good comparison, however.

ants_everywhere

4 months ago

> but it (like Gemini CLI) still feels to me rather thrown together - a Claude Clone with a lot of rough edges.

I think this is because they see it as a checkbox whereas Anthropic sees it as a primary feature. OpenAI and Google just have to invest enough to kill Anthropic off and then decide what their own vision of coding agents looks like.

paulddraper

4 months ago

You can run the Claude code router and choose the model you want (including based on dynamic conditions)

jessmartin

4 months ago

Can you say more? Link?

paulddraper

4 months ago

https://www.google.com/search?q=claude+code+router

fragmede

4 months ago

Thick or thin, the wrapper so that users aren't manually copy and pasting code around is material to it being used and useful. Plus the systems prompt is custom to each tool and greatly affect how well the tool works.

cpursley

4 months ago

You can actually use Codex right from Claude Code as an MCP without that proxy stuff and it works really well, especially for review or solving things Claude couldn't. Best of both worlds!

jacurtis

4 months ago

Yeah I think the argument is the tooling vs agent. Maybe the OpenAI agent is performing better now, but the tooling is significantly better from anthropic.

The anthropic (ClaudeCode) tooling is best-in-class to me. You listed many features that I have become so reliant on now, that I consider them the Ante that other competitors need to even be considered.

I have been very impressed with the Anthropic agent for code generation and review. I have found the OpenAI agent to be significantly lacking by comparison. But to be fair, the last time I used OpenAI's agent for code was about a month ago, so maybe it has improved recently (not at all unreasonable in this space). But at least a month ago when using them side-by-side the codex CLI was VERY basic compared to the wealth of features and UI in the ClaudeCode CLI. The agents for Claude were also so much better than OpenAI, that it wasn't even close. OpenAI has always delivered me improper code (non-working or invalid) at a very high rate, whereas Claude is generally valid code, the debate is just whether it is the desired way to build something.

Footprint0521

4 months ago

I agree!! But this repo

https://github.com/just-every/code

Fixed all of these in a heartbeat. This has been a game changer

noobly

3 months ago

Code is amazing. I'm not sure why OpenAI isn't using it as their default CLI. I was cancelling my membership and stumbled upon it right before, now I'm dropping my other subs to move to this.

Palmik

4 months ago

I am not sure copying your competitors feature-by-feature is always a good strategy. It can make the onboarding of your competitor's users easier, but lead to a worse product overall.

This is especially the case in a fast moving field such as this. You would not want to get stuck in the same local minimum as your competitor.

I would rather we have competing products that try different things to arrive at a better solution overall.

user

4 months ago

[deleted]

ryuuseijin

4 months ago

I'm using opencode which I think is now very close to covering all the functionality of claude code. You can use GPT5 Codex with it along with most other models.

cpursley

4 months ago

Is there a way to use this with your own openai or anthropic keys?

ryuuseijin

4 months ago

Yes, I only use my own keys. It even lets you use your Claude Max subscription.

user

4 months ago

[deleted]

stared

4 months ago

Claude Code has a lot or UX polish: https://newsletter.pragmaticengineer.com/p/how-claude-code-i...

user

4 months ago

[deleted]

jatora

4 months ago

to fix having to approve commands over and over - use windows WSL. codex does not play nice with permissions/approvals on windows. WSL solves that completely

pkreg01

4 months ago

I totally agree. I remember the June magic as well - almost overnight my abilities and throughput were profoundly increased, I had many weeks of late nights in awe and wonder trying things that were beyond my ability to implement technically but within the bounds of my conceptual understanding.

Initially, I found Codex CLI with GPT-5 to be a substitute for Claude Code - now GPT-5 Codex materially surpasses it in my line of work, with a huge asterisk. I work in a niche industry, and Codex has generally poor domain understanding of many of the critical attributes and concepts. Claude happens to have better background knowledge for my tasks, so I've found that Sonnet 4.5 with Claude Code generally does a better job at scaffolding any given new feature. Then, I call in Codex to implement actual functionality since Codex does not have the "You're absolutely right" and mocked/placeholder implementation issues of CC, and just generally writes clean, maintainable, well-planned code. It's the first time I've ever really felt the whole "it's as good as a senior engineer" hype - I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.

I think Codex is a generally better product with better pricing, typically 40-50% cheaper for about the same level of daily usage for me compared to CC. I agree that it will take a genuinely novel and material advancement to dethrone Codex now. I think the next frontier for coding agents is speed. I would use CC over Codex if it was 2x or 3x as fast, even at the same quality level. Otherwise, Codex will remain my workhorse.

thecoppinger

4 months ago

> trying things that were beyond my ability to implement technically but within the bounds of my conceptual understanding

This is a really neat way of describing the phenomenon I've been experiencing and trying to articulate, cheers!

Arisaka1

4 months ago

When I was in high school, I would see the algebra teacher work through expressions and go "ohhh, that makes sense". But when I got back home to work with the homework, I couldn't make the pieces fit.

Isn't that the same? Just because you recognize something someone else wrote and makes you go "ohh, I understand it conceptually" doesn't mean that you can apply that concept in a few days or weeks.

So when the person you responded to says:

>almost overnight *my abilities* and throughput were profoundly increased

I'd argue the throughput did but his abilities really weren't, because without the tool in question you're just as good as before the tool. To truly claim that his abilities were profoundly increased, he has to be able to internalize the pattern, recognize the pattern, and successfully reproduce it across variable contexts.

Another example would be claiming that my painting abilities and throughput were profoundly increased, because I used to draw stick figures and now I can draw Yu-Gi-Oh! cards by using the tool. My throughput was really increased, but my abilities as a painter really haven't.

catigula

4 months ago

>I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.

This is beyond bananas to me given that I regularly see codex high and Gpt-5-high both fail to create basic react code slightly off the normal distribution.

hansvm

4 months ago

That might say something about the understandability of the react framework/paradigm ;)

Quality varies a lot based on what you're doing, how you prompt it, how you orchestrate it, and how you babysit and correct it. I haven't seen anything I'd call senior, but I have seen it, for some classes of tasks, turn this particular engineer into many seniors. I still have to supply all the heavy lifting (here's the concurrency model, how you'll ensure exactly-once-delivery, particular functions and classes you definitely want, a few common pitfalls to avoid, etc), but then it can flesh out the details extremely well.

aaronblohowiak

4 months ago

It makes me waaayyyy faster but, like you, that’s because I already know what has to be done.

evilduck

4 months ago

If you really want to see it fail at something easy, try to have write something that can use JSX but doesn't use React (Bun, Hono, etc). Seems like no amount of context management and detailed instructions will keep it from reaching for React-isms.

catigula

4 months ago

Bear AI signal whenever we see glimpses that the reasoning is just pattern matching to artifacts of actual human reasoning.

pkreg01

4 months ago

Do you mind if I ask what kind of React code you're working on? I've had good success using Codex for my frontend development, especially since all of my projects consistently rely on a pretty widely used and well documented component library. I realize that makes my use case fairly narrow, so I don't think I've discovered the limits you have.

catigula

4 months ago

Normal legacy react enterprise application.

Today I was trying to get it to temporarily shim in for development and consume the value of a redux store via merely putting a default in the reducer. Depending on that value, the application would present different state.

It failed to accomplish this and added a disgusting amount of defensive nonsense code in my saga, reducer and component to ensure the value was there. It took me a very short time to correct it but just watching it completely fail at this task was borderline absurd.

pkreg01

4 months ago

Thanks for the context! I feel the same way. When it fails it fails hard. This is why I'm extremely skeptical of any of the non-cli cloud solutions - as you observed, I think the failures compound and cascade if you don't stop them early, which requires a compelling interface and the ability to manually intervene very fast.

bad_haircut72

4 months ago

Im not saying this is a paid endorsement but the internet is dead and I wonder what openAI would pay, if they could, to get such a glowing review as top comment on HN

neya

4 months ago

For what it's worth, I'm not affiliated with Open AI (you can verify by my comment history [1] and account age) and I agree with the top comment. I do Elixir consulting primarily and nothing beats OpenAI's model at the moment for Elixir. Previously, their O3 models were quite decent. But, GPT-5 is really damn good. Claude code will unnecessarily try to complicate a problem solution.

[1] https://news.ycombinator.com/item?id=45491842

dns_snek

4 months ago

This is hilarious because for me Cursor with GPT-5 often generates Elixir that isn't even syntactically correct. It needs to be told not to use return statements, and not to try to index linked lists as arrays. Code is painfully non-idiomatic to the point of being borderline useless even in the simpler cases. Claude Sonnet 4.5 is marginally better, but not by much. Any ambitious overhaul, refactoring or large feature ends in tears and regret.

Neither tool is worth paying even $20 a month for when it comes to Elixir, that's how little value I get out of them, and it's not because I can't afford it.

neya

4 months ago

Gemini is also good, I recommend you try it as well. Usually my workflow is GPT-5 as the primary, but yes, as you mentioned it is not perfect. But Gemini surprisingly compliments GPT-5 for my use cases atleast. It's good at LiveView related stuff, whereas GPT-5 is more of architecting side.

Both LLMs suck if you let it do everything without architecting the solution first. So, I always instruct the high level architecture of how I want something, specifically around how the data should flow and be consumed and what I really want to avoid. With these constraints and bit of some prompt engineering, they are actually quite good.

dns_snek

4 months ago

> Both LLMs suck if you let it do everything without architecting the solution first.

I always do that. Last time I spent an hour planning, going through the requirements, having it ask questions, only for it to completely botch the implementation.

Sure, I can treat it like a junior and spend 2-3 hours planning everything down to the individual function level and it's going to implement it alright. The code will work but it won't be idiomatic. Or I can just do it myself in 3 hours total to a much higher standard of quality, without gambling on a successful outcome, while simultaneously improving my own knowledge, understanding, and abilities.

No matter how I try to use them, agentic coding is always a net negative on my productivity (disposable one-off scripts excluded).

cpursley

4 months ago

Try tidewave.ai, Jose made it (mcp thingy). Works well with GPT-5.

fragmede

4 months ago

btw your website doesn't load

cpursley

4 months ago

It's not my website, but I do use the free mcp with CC.

https://tidewave.ai

fragmede

4 months ago

no i mean https://chasepursley.com

cpursley

4 months ago

Ah, thanks!

johnisgood

4 months ago

Personally I found Claude to be relatively OK at Elixir. With a lot of hand holding. My main problem when it comes to Elixir and Erlang is many amount of files. For that kind of boilerplate, it is good. Otherwise just use "erlang-skels.el" with Emacs. :D

Palmik

4 months ago

I'm not saying this was a paid comment, but if we're going to speculate, we could just as easily ask what Anthropic would pay, if they could, to drown out a strongly pro-OpenAI take sitting at the top of their own promotional HN thread.

That said, you're right that the broader internet (Reddit especially) is heavily astroturfed. It's not unusual to see "What's the best X?" threads seeded by marketers, followed by hoard of suspiciously aligned comments.

But without actual evidence, these kind of meta comments like yours (and mine) are just a cynical noise.

vietvu

4 months ago

I heard this opinion a lot recently. Codex is getting better, and Claude is getting worse so it's must happen sooner or later. Well, it's competition so waiting for Claude to catch up. The web Claude Code is good, but they really need to fix their quota. It's unusable. I would choose a worse model (maybe at 90%), but has better quota and usable. Not to mention GPT-5 and GPT-5-codex seems catch up or even better now.

hluska

4 months ago

Are you really going to call someone a shill? I’d argue that you’re why the internet is dying - a million options and you had to choose the most offensive?

brigandish

4 months ago

The only way to tell human from AI now is disagreeableness, it’s the one thing the GPTs refuse to do. I can’t stand their cloying sycophancy but at least it means that serial complainers will gain some trust, at least for as long as Americans are leading the hunt and deciding to baby us.

dr_dshiv

4 months ago

On the other hand, formulaic disagreement underpins most of modern media; made by humans or not, it ends up as dehumanizing as a train wreck.

user

4 months ago

[deleted]

visiondude

4 months ago

I completely agree with this. The amount of unprompted “I used to love Claude Code but now…” content that follows the exact same pattern feels really off. All of these people post without any prompts for comparison, and OP even refused to share specifics so we have to take his claim as ‘trust me bro’

loveparade

4 months ago

It doesn't feel off to me because that's the exact experience I've had as well. So it's unsurprising to me that many other people share that experience. I'm sure there is a bunch of paid promotion going on for all kinds of stuff on HN (especially what gets onto the front page), but I don't think this is one of those cases.

visiondude

4 months ago

Oh cool, can you share concrete examples of times codex out performed Claude Code? I’m my experience both tools needs to be carefully massaged with context to fulfill complex task.

typpilol

4 months ago

In my experience. Claude wants to try and finish everything as quickly as possible where codex is happy to take 5x the length.

The best answer is each has its uses. Using codex to do bulk edits is dumb because it takes forever, etc etc

loveparade

4 months ago

I don't really see how examples are useful because you're not going to understand the context. My prompt may be something like "We recently added a new transcription backend api (see recent git commits), integrate it into the service worker. Before implementing, create a detailed plan, ask clarifying questions, and ask for approval before writing code"

Does that help you? I doubt it. But there you go.

hluska

4 months ago

Nobody has to give you examples. People can express opinions. If you disagree, that’s fine but requesting entire prompt and response sets is quite demanding. Who are you to be that demanding?

dns_snek

4 months ago

> Who are you to be that demanding?

Let's call it the skeptical public? We've been listening to a group of people rave about how revolutionary these tools are, how they're able to perform senior level developer work, how good their code is, and how they're able to work autonomously through the use of sub-agents (i.e. vibe coding), without ever providing evidence that would support any of those grandiose claims.

But then I use these tools myself[1] and I speak to real developers who have used them and our evaluation centers around lukewarm, e.g. good at straightforward, junior level tasks, or good for prototyping, or good for initially generating tests, or good for answering certain types of questions, or good for one-off scripts, but approximately none of them would trust these LLMs to implement a more complex feature like a mid-level or senior developer would without very extensive guidance and hand-holding that takes longer than just doing it ourselves.

Given the overwhelming absence of evidence, the most charitable conclusion I can come to is that the vast majority of people making these claims have simply gone from being 0.2X developers to being 0.3X developers who happen to generate 5X more code per unit of time.

[1] e.g. my reply to https://news.ycombinator.com/item?id=45651948

ssk42

4 months ago

Context engineering is a critical part of being able to use the tool. And it's ok to not understand how to use a new tool. The different models combined with different stacks require different ways of grappling with the technology. And it all changes! It sucks that you've tried it for your stack (Elixir, whatever that is) in your way and it was disappointing.

To me, the tool inherently makes sense and vibes with my own personality. It allows me to write code that I would otherwise procrastinate on. It allows me to turn ideas into reality, so much faster.

Maybe you're just hyper focused on metrics? Productivity, especially when dealing with code, is hard to quanitfy. This is a new paradigm and so it's also hard to compare apples to oranges. Does this help?

dns_snek

4 months ago

So your take is that every real software developer I know is simply bad at using this magical tool that performs on the level of mid-senior level software engineer in the hands of a few chosen ones? But the chosen ones never build anything in public where it can be observed, evaluated, and critiqued. How unfortunate is that?

The people I talked to use a wide variety of environments and their experience is similar across the board, whether they're working in Nodejs, React, Vue, Ruby, PHP, Java, Elixir, or Python.

> Productivity, especially when dealing with code, is hard to quanitfy.

Indeed, that's why I think most people claiming these obscene benefits are really bad at evaluating their own performance and/or started from a really low baseline.

I always think back to a study I read a while ago where people without ADHD were given stimulant medication and reported massive improvements in productivity but objective measurements showed that their real-world performance was equal to, or slightly lower than their baseline.

I think it's very relevant to the psychology behind this AI worship. Some people are being elevated from a low baseline whilst others are imagining the benefits.

ssk42

4 months ago

People do build in public from vibe-coding, absolutely. This tells me that you have not done your research and just gone off of general guesses or pessimism/frustration from not knowing how to use the tool. The easiest way to be able to find this on Github is to look for where Claude is a contributor. Claude will tag itself in the PR or pushes. Another easy way to that I've seen come up for this is there is a whole "BuildInPublic" tag in the Threads app which has been inundated with Vibe coding. While these might not be in your algorithm, they do exist. You'll be able to see that while there is a lot of crud that there are also products being made are actually versatile, complex, and completely vibe-coded. Most people are not making up these stories. It's very real.

dns_snek

4 months ago

Of course people vibe-code in public - I was clear that I wanted to see evidence of these amazing productivity improvements. If people are building something decent but it takes them 3 or 4 times as long as it would take me, I don't care. That's great for them but it's worthless to me because it's not evidence of a productivity increase.

> there are also products being made are actually versatile, complex, and completely vibe-coded.

Which ones? I'm looking for repositories that are at least partially video-documented to see the author's process in action.

hattmall

4 months ago

I'm not saying it is, but if ANYTHING was the exact combination of prerequisites to be considered paid promotion on HN, this is the type of comment it would be.

hluska

4 months ago

So, let’s see if I get this straight. A highly identifiable person whose company sells a security product is the ideal shill? That doesn’t make any sense whatsoever. On the other hand, someone with a different opinion makes complete sense.

hattmall

3 months ago

Lebron James endorses KIA. Multi-billion dollar companies can afford and benefit from highly identifiable people so I don't really think that argument makes it any less likely to be an endorsement.

user

4 months ago

[deleted]

dbbk

4 months ago

You're absolutely right!

a_victorp

4 months ago

This is an underrated comment

h34t

4 months ago

to be fair, they spent a lot on compute.

WXLCKNO

4 months ago

I agree with this and actually Claude Code agrees with it too. I've had Codex cli (gpt-5-codex high) and claude code 4.5 sonnet (and sometimes opus 4.1) do the same lengthier task with the same prompt in cloned folders about 10x now and then I ask them to review the work in the other folder and determine who did the best job.

100% of the time Codex has done a far better job according to both Codex and Claude Code when reviewing. Meeting all the requirements where Claude would leave things out, do them lazily or badly and lose track overall.

Codex high just feels much smarter and more capable than Claude currently and even though it's quite a bit slower, it's work that I don't have to go over again and again to get it to the standards I want.

pkreg01

4 months ago

I share your observations. It's strange to see Anthropic loosing so much ground so fast - they seemed to be the first to crack long-horizon agentic tasks via what I can only assume is an extremely exotic RL process.

Now, I will concede that for non-coding long-horizon tasks, GPT-5 is marginally worse than Sonnet 4.5 in my own scaffolds. But GPT-5 is cheaper, and Sonnet 4.5 is about 2 months newer. However, for coding in a CLI context, GPT-5-Codex is night-and-day better. I don't know how they did it.

typpilol

4 months ago

Every since 4.5, I can't get Claude to do anything that takes a while

4.0 would chug a long for 40 mins. 4.5 refuses and straight up says the scope is too big sometimes.

My theory is anthropic is super compute constrained and even though 4.5 is smarter, the usage limits and it's obsession with rushing to finish was put in mainly to save their servers compute.

swah

4 months ago

I haven't been able to get anything done with Codex. Claude Code is fast and "gets it". Also does better at running and testing its own stuff.

Its very odd because I was hoping they were very on par.

didibus

4 months ago

Same, I find Codex not good to be honest. I have better success manually copy/pasting into GPT5 chat. There's something about Codex that just wants to change everything and use the weirdest tool commands.

It also often fails to escalate a command, it'll even be like, oh well I'm in a sandbox so I guess I can't do this, and will just not do it and try to find a workaround instead of escalating permission to do the command.

jacurtis

4 months ago

The last time I used them both side by side was a month ago, so unless its significantly improved in the past month, I am genuinely surprised that someone is making the argument that Codex is competitive with ClaudeCode, let alone it somehow being superior.

ClaudeCode is used by me almost daily, and it continues to blow me away. I don't use Codex often because every time I have used it, the output is next to worthless and generally invalid. Even if it does get me what I eventually want, it will take much more prompting for me to get the functioning result. ClaudeCode on the other hand gets me good code from the initial prompt. I'm continually surprised at exactly how little prompting it requires. I have given it challenges with very vague prompts where it really exceeds my expectations.

clarkmoreno

4 months ago

OpenAI astroturfing is a real thing. It's all over Twitter. Unsurprising but still wild to see it here on HN.

barneybooroo

4 months ago

I think the enthusiasm for Codex coincided with the extended period of degraded quality CC was experiencing around a couple of months ago? During that time I cancelled my Claude sub and tried out Codex, which by comparison was feeling significantly better. I haven't tried them out side by side since Claude has been de-borked but even if Codex is objectively poorer I could believe that flattering comparison has stuck for people who switched?

acangiano

4 months ago

I use both but I agree that they are generally not on par. I find Claude Code does a better job and doesn't overengineer as much. Where sometime Codex does better is in debugging a tough bug that stumps Claude Code. Codex is also more likely to get lazy and claiming to have finished a large task, when it reality it just wrote some placeholder lines. Claude has never done that. They might be on par soon, however, and I think Anthropic is playing a dangerous game with their limit enforcement on people who are on subscriptions.

veidr

4 months ago

Me too, but I know it's not just people shilling, or on the take, because a bunch of people I know personally have moved from Claude Code to Codex, and say it's better.

For me, though, it's not remotely close. Codex has fucked up 95% of the 50-or-so tasks I asked it to do, while Claude Code fucks up only maybe 60%.

I'm big on asking LLMs to do the first major step of something, and then coming back later, and if it looks like it kinda sucks, just Ctrl-C and git revert that container/folder. And I also explicitly set up "here are the commands you need to run to self-check your work" every time. (Which Codex somewhat weirdly sometimes ignores with the explicit (false) claim that it skipped that step because it wasn't requested... hmm.)

So, those kinds of workflow preferences might be a factor, but I haven't seen Codex ever be good yet, and I regret the time I invested trying it too early.

cesarvarela

4 months ago

Can you share an example of the tasks you found Codex being much better? From my experience Claude Code is much better.

intellectronica

4 months ago

Codex works much better for long-running tasks that require a lot of planning and deep understanding.

Claude, especially 4.5 Sonnet, is a lot nicer to interact with, so it may be a better choice in cases where you are co-working with the agent. Its output is nicer, it "improvises" really well even if you give it only vague prompts. That's valueable for interactive use.

But for delegating complete tasks, Codex is far better. The benchmarks indicate that, as do most practicioners I talk to (and it is indeed my own experience).

In my own work, I use Codex for complete end-to-end tasks, and Claude Sonnet for interactive sessions. They're actually quite different.

incoming1211

4 months ago

I disagree, Codex always gets stuck and wants to double check and clarify things, its like "dammit just execute the plan and don't tell me until its completely finished"

The output of codex is also not as great. Codex is great at the planning and investigation portion but sucks at execution and code quality.

ewoodrich

4 months ago

I've been dealing with this on Codex a lot lately. It confidently wraps up a task, I go to check it's work... and it's not even close.

Then I do a double take and re-read the summary message and realize that it pulled a "and then draw the rest of the owl", seemingly arbitrarily picking and choosing what it felt like doing in that session and what it punted over to "next steps to actually get it running".

Claude is more prone to occasional "cheating" with mocked data or "tbd: make this an actual conditional instead of hardcoded If True" stuff when it gets overwhelmed which is annoying and bad. But it at least has strong task adherence for the user's prompt and doesn't make me write a lawyer-esque contract to avoid any loopholes Codex will use to avoid doing work.

aaronblohowiak

4 months ago

Are you using something like spec-kit?

shmoogy

4 months ago

Can / Does Codex actually check docker logs and other things for feedback while iterating on something that isnt working ? That is where the true magic of Claude comes for me. Often things cant be one shot, but being able to iteratively check logs, make an adjustment, rebuild the docker containers, send a curl, and confirm fixed is huge improvement.

intellectronica

4 months ago

Yes, in this regard it's very similar. It works as an agent and does whatever you need it to do to complete the task. In comparison to Claude it tends to plan more and improvise less.

mordymoop

4 months ago

I'm on the same page here. I have seen this sentiment about Codex suddenly being good a few times now, so I booted Codex CLI thinking-high back up after a break and asked it to look for bugs. It promptly found five bugs that didn't actually exist. It was the kind of truly impressively stupid mistake that I haven't seen Claude Code make essentially ever, and made me wonder if this isn't the sort of thing that's making people downplay the power of LLMs for agentic coding.

stavros

4 months ago

I asked Sonnet 4.5 to find bugs in the code, it found five high-impact bugs that, when I prompted it a second time, it admitted weren't actually bugs. It's definitely not just Codex.

throwaway-0001

4 months ago

In my case codex fixed a bug in one shot. Took 10 min to debug and find it.

Claude struggled long time and still didn’t find.

simplify

4 months ago

Same here. I tried codex a few days ago for a very simple task (remove any references of X within this long text string) and it fumbled it pretty hard. Very strange.

fragmede

4 months ago

yeah I'm in the same boat. Codex can't do this one task, and constantly forgets what I've told it, and I'm reading these comments saying how is so great to the point that I'm wondering if I'm the one taking the crazy pills. Maybe we're being A/B tested and don't know about it?

hattmall

4 months ago

No, no one that's super boosting the LLMs ever tells you what they are working on or give any reasonable specifics about how and why it's beneficial. When someone does, it's a fairly narrow scope and typically inline with my experience.

They can save you some time by doing some fairly complex basic tasks that you can write in plain language instead of coding. To get good results you really need a lot of underlying knowledge yourself and essentially, I think of it as a translator. I can write a program in very good detail using normal language and then the LLM can convert it to code with reasonable accuracy.

I haven't been able to depend on it to do anything remotely advanced. They all make up API endpoints or methods or fill in data with things that simply don't exist, but that's the nature of the model.

fragmede

4 months ago

You misread me. I'm one of the people you're complaining about. Claude code has been great in my experience and no I don't have a GitHub repo of code that's been generated for you to tell me that's trivial and unadvanced and that a child could do it.

What I'm saying was to compare my experience with Claude code vs Codex with GPT-5. CC's better than codex in my experience, contrary to GP's comment.

FuckButtons

4 months ago

Maybe, just maybe, people are lying on the internet. And maybe those people have a financial interest in doing so.

the_duke

4 months ago

IMO gpt5-codex medium is much better as soon as the task becomes slightly complex, or the context grows a bit.

Sora 4.5 tends to randomly hallucinate odd/inappropriate decisions and goes to make stupid changes that have to be patched up manually.

jacurtis

4 months ago

Yes Sora hallucinates significantly more than Claude.

I find that Codex generally requires me to remove code to get to what I want, whereas Claude I tend to use what it gives me and I add to it. Whether this is from additional prompting or from manual typing, i just find that codex requires removal to get to desired state, and Claude requires adding to get to desired state. I prefer adding incrementally than removing.

Palmik

4 months ago

Curiously, you yourself did not provide an example where, from your experience, Claude Code was much ebtter.

mmaunder

4 months ago

I can not. We're all racing very hard to take full advantage of these new capabilities before they go mainstream. And to be honest, sharing problem domains that are particularly attractive would be sharing too much. Go forth and experiment. Have fun with it. You'll figure it out pretty fast. You can read my other post here about the kinds of problem spaces I'm looking at.

deadbabe

4 months ago

Ah, super secret problem domains that have been thoroughly represented in the LLM training data. Nice.

aprilthird2021

4 months ago

Why would you even comment that Codex CLI is potentially worth switching an enormous amount of spend over ($70k) and give literally 0 evidence of why it's better? That's all you've got? "Trust me bro"?

mmaunder

4 months ago

I'm seeing the downvotes. I'm sorry folks feel that way. I'm regretting my honesty.

Edit: I'd like to reply to this comment in particular but can't in a threaded reply, so will do that here: "Ah, super secret problem domains that have been thoroughly represented in the LLM training data. Nice."

This exhibits a fundamental misunderstanding of why coding agents powered by LLMs are such a game changer.

The assumption this poster is making is that LLMs are regurgitating whole cloth after being trained on whole cloth.

This is a common mistake among lay people and non-practitioners. The reality is that LLMs have gained the ability to program, by learning from the code of others. Much like a human would learn from the code of others, and then be able to create a completely novel application.

The difference between a human programmer an an agentic coder is that the agent has much broader and deeper expertise across more programming languages, and understands more design patterns, more operating systems, more about programming history, etc etc and it uses all this knowledge to fulfill the task you've set it to. That's not possible for any single human.

It's important for the poster to take two realities on board: Firstly, agentic coding agents are not regurgitating whole cloth from whole cloth. Instead they are weaving new creations because they have learned how to program. Secondly, agentic coding agents have broader and deeper knowledge than any human that will ever exist, and they never tire, and their mood and energy level never changes. In fact that improves on a continuous basis as the months go by and progress continues. This means we can, as individual practitioners or fast moving teams, create things that were never before possible for us without raising huge amounts of money and hiring large very expensive teams, and then having the overhead of lining everyone up behind a goal AND dealing with the human issues that arise, including communication overhead.

This is a very exciting time. Especially if you're curious, energetic, and are willing to suspend disbelief to go and take a look.

nik_0_0

4 months ago

I don't have any particular horse in this race, but looking at this exchange, I hope its clear where the issue is coming from.

The original post states "I am seeing Codex do much better than Claude Code", and when asked for examples, you have replied with "I don't have time to give you examples, go do it yourself, its obvious."

That is clearly going to rub folks (anyone) the wrong way. This refrain ("Wheres the data?") pops up frequently on HN, if its so obvious, giving 1 prompt where Codex is much greater than Claude doesn't seem like a heavy lift.

In absence of such an example, or any data, folks have nothing to go on but skepticism. Replying with such a polarizing comment is bound to set folks off further.

Vegenoid

4 months ago

We've all been hearing from people talking about how amazing AI coding agents are for a while now. Many skeptics have tried them out, looked into how to make good use of them, used modern agentic tools, done context engineering, etc. and found that they did not live up to the claims being made, at least for their problem domain.

Talk is cheap, and we're tired of hearing people tell us how it's enabling them to make incredible software without actually demonstrating it. Your words might be true, or they might be just another over-exaggeration to throw on the pile. Without details we have no way of knowing, and so many make the empirically supported choice.

chaboud

4 months ago

I agree. It’s pretty easy to put-up or shut up.

I recently vibe coded a video analysis pipeline with some related arduino-driven machine control. It was work to prototype an experience on some 3D printed hardware I’ve been skunking out.

By describing the pipeline and filters clearly, I had the analysis system generating useful JSON in an hour or so, including machine control simulation, all while watching TV and answering emails/slacks. Notable misses were that the JSON fields were inconsistent, and the python venvs were inconsistent for the piped way that I wanted the system to operate with.

Small fixes.

Then I wired up the hardware, and the thing absolutely crapped itself, swapping libraries, trying major structural changes, and creating two whole new copies of the machine control host code (asking me each time along the way). This went on for more than three hours, with me debugging the mess for about 20 minutes before resorting to 1) ChatGPT, which didn’t help, followed by 2) a few minutes of good old fashioned googling on serial port behavior on Mac, which, with an old sitting on the shelf Uno R3, meant that I needed to use the cu.* ports instead of tty.*, something that Claude Code had buried deeply in a tangle of files.

Curious about the failure, I told Claude Code to stop being an idiot and use a web browser to go research the problem of specifically locking up on the open operation. 30 seconds later, and with some reflective swearing from Opus 4.1, which I appreciate, I had the code I should have had 3 hours prior (along with other garbage code to clean up).

For my areas of sensing, computer vision, machine learning, etc., these systems are amazingly helpful if the algorithms can be completely and clearly described (e.g., Kalman filter to IoU, box blur followed by subsampling followed by split exponential filtering, etc.).

Attempts to let the robots work complex pipelines out for themselves haven’t gone as well for me.

com2kid

4 months ago

I just had Claude code convert all my personal projects over to be dockerized, and then setup the deployment infra and scripts for everything, and finally move my server off of the nightmare nginx config file I was using.

zamadatix

4 months ago

Never hold regret for having honesty, it tends to lose its value completely if you only care about it when you have good news to deliver. If for anything, hold regret for when you didn't have something better appreciated to be honest about.

The easier threading-focused approach to the conversation might be to add the additional comment as an edit at the end of the original and reply to the child https://news.ycombinator.com/item?id=45649068 directly. Of course, I've broken the ability to do that by responding to you now about it ;).

mmaunder

4 months ago

Thanks. I wasn't able to reply in a thread earlier - I guess HN has a throttle on that. So I edited the comment above to add a few more thoughts. It's a very exciting time to be alive.

jamiek88

4 months ago

Just click on the time. Where yours says ‘two hours ago’ now, if you click on that you can reply directly to any sub comment in a thread.

mmaunder

4 months ago

lol, thanks.

johnfn

4 months ago

You’re getting downvoted because the amount of weight I place on your original comment is contingent on whether or not you’re actually using AI to do meaningful work ot not. Without clarifying what you’re doing, it’s impossible to distinguish you from one of those guys that says he’s using AI to do tons of work and then you peek under the hood and he’s made like 15 markdown files and his code is a mess that doesn’t do anything.

Well, that, and it’s just a bit annoying to claim that you’ve found some amazing new secret but that you refuse to share what the secret is. It doesn’t contribute to an interesting discussion whatsoever.

preommr

4 months ago

> I'm seeing the downvotes. I'm sorry folks feel that way. I'm regretting my honesty.

What honesty? We're not at the point of "the Godfather was a good/bad movie", we're at "no, trust, there's a really good movie called the Godfather".

Your honesty means nothing for an issue that isn't about taste or mostly subjectivness. How useful AI is and in what way is a technical discussion where the meat of the subject matter is. You've shared nothing on that front. I am not saying you have to, but like obviously people are going to downvote you - not because they might agree/disagree but because it's contributed nothing different from every other ai-hype man selling a course or something.

kobe_bryant

4 months ago

this is absurd. no one needs or wants your AI generated answer that's a whole lot of nothing

mmaunder

4 months ago

Comments like this reveal the magnitude of polarization around this issue in tech circles. Most people actually feel this kind of animosity towards AI, and so having comment threads like this even be visible on HN is unusual. Needless to say, all my comments here are hand written. But the poster knows that, of course.

maherbeg

4 months ago

Yeah this has been my experience as well. The Claude Code UI is still so much better, and the permissioning policy system is much better. Though I'm working on closing that gap by writing a custom policy https://github.com/openai/codex/blob/main/codex-rs/execpolic...

Kinda sick of Codex asking for approval to run tests for each test instance

mmaunder

4 months ago

Ah the tension between cybersecurity best practices and productivity is brutal right now.

maherbeg

4 months ago

lol yeah, but mostly just want to allow more types of reads for getting context, and primarily for test running / linting etc. I shouldn't have to approve every invocation of `pytest` or `bazel test`.

fragmede

4 months ago

--dangerously-bypass-approvals-and-sandbox isn't enough for you?

maherbeg

4 months ago

I don't want unlimited writes. I basically want to unlock nearly everything but approve writes in some scenarios.

fragmede

4 months ago

Where do unix permissions and a different user and extended attributes fall short for that?

rtfeldman

4 months ago

You don't have to use Codex in its terminal UI - e.g. you can use it in the Zed IDE out-the-box:

https://zed.dev/blog/codex-is-live-in-zed

PantaloonFlames

4 months ago

And also in emacs or neovim

https://xenodium.com/introducing-acpel

lherron

4 months ago

Still a toss-up for me which one I use. For deep work Codex (codex-high) is the clear winner, but when you need to knock out something small Claude Code (sonnet) is a workhorse.

Also CC tool usage is so much better! Many, many times I’ve seen Codex writing a python script to edit a file which seems to bypass the diff view so you don’t really know what’s going on.

bcrosby95

4 months ago

Yeah, after correcting it several times I've gotten Claude Code to tell me it didn't have the expertise to work in one of my problem domains. It was kinda surprising but also kinda refreshing that it knew when to give up. For better or worse I haven't noticed similar things with Codex.

mmaunder

4 months ago

I've chosen problems with non-negotiable outcomes. In other words, problem domains where you either are able to clearly accomplish the very hard thing, or not, and there's no grey area. I've purposely chosen these kinds of problems to prove what AI agents are capable of, so that there is no debate in my mind. And with Codex I've accomplished the previously impossible. Unambiguously. Codex did this. Claude gave up.

It's as if there are two vendors saying they can give up incredibly superpowers for an affordable price, and only one of them actually delivers the full package. The other vendor's powers only work on Tuesdays, and when you're lucky. With that situation, in an environment as competitive as things currently stand, and given the trajectory we're on, Claude is an absolute non-starter for me. Without question.

Aeolun

4 months ago

I don’t think Claude is actually incapable, you just spend a lot of time telling it to yes, please actually do the difficult thing. Do not give up halfway through.

Codex says “This is a lot of work, let me plan really well.”

Claude says “This is a lot of work, let me step back and do something completely different that you didn’t ask for.”

corndoge

4 months ago

Can you expound a bit on the problem domains? I am curious

skybrian

4 months ago

We need product reviewers who can demonstrate things like this in public. Without details, "it works for me on my projects" only goes so far.

kelvinjps10

4 months ago

I did the opposite I switched to Claude code once the released the new model last week of the one before, I tried using codex, but there was issues with the terminal and prompting (multiple characters getting deleted) I found Claude code to have more features and less bugs, like the edit on vim for the prompt being really useful and find it better to iterate. Also I like more its tool usage and the use of the shell. Sometimes codex prefer to use python instead of doing the equivalent shell command. Maybe it's like the other people say here, that codex it's better for long running tasks, I prefer to give Claude small tasks and I'm usually satisfied with the result and I like to work alongside the agent

catigula

4 months ago

This is such an interesting perspective because I feel codex is hugely impressive but falls apart on any even remotely difficult task and is too autonomous and not eager enough.

Claude feels like a better fit for an experienced engineer. He's a positive, eager little fellow.

hn_saver

4 months ago

How did you spend $70k per year for a tool that's not a single year old?

TkTech

4 months ago

API pricing rates probably. If I take a look at my current usage since it came out, it'd be about $12000 CAD if paid at API rates. Ridiculously easy to rack up absurd bills via the API, and I'm mostly just using it for code review. Someone using it heavily could easily, easily get way over 70k.

tstrimple

4 months ago

Also the statement was "We". It's not a single user's billable usage and we have zero details as to how many people made up "We". So any analysis into the cost or value are meaningless.

koakuma-chan

4 months ago

Why did they not buy a subscription? It would be a flat fee.

NiloCK

4 months ago

At that spend, no subscription is available to serve that much traffic - they are all rate limited.

I understand the 70K spend as a corporate expense, not an individual... right?

koakuma-chan

4 months ago

I haven't checked but it would make sense if each developer had their own rate limit.

spoiler

4 months ago

My experience is that if I know what I want, CC will produce better code, given I specify it correctly. The planning mode is great for this too, as we can "brainstorm" and what I have seen help a lot is if I ask questions about why it did a certain way. Often it'll figure out on its own why that's wrong, but sometimes it requires a bit of course correction.

On the other hand, last time I tried GPT-5 from Cursor, it was so disappointing. It kept getting confused while we were iterating on a plan, and I had to explain to it multiple times that it's thinking about the problem the same way. After a while I gave up, opened a new chat and gave it my own summary of the conversation (with the wrong parts removed) and then it worked fine. Maybe my initial prompt was vague, but it continually seemed to forget course corrections in that chat.

I mostly tend to use them more to save me from typing, rather than asking it to design things. Occasionally we do a more open ended discussion, but those have great variance. It seems to do better with such discussions online than within the coding tool (I've bounced maths/implementation ideas off of while writing shaders on a personal project)

baq

4 months ago

gpt-5-high is amazing, but so slow I'll revert to sonnet when I know what I need done on a low level.

when making boilerplatish changes in the product in areas I'm not familiar with (it's a large codebase) gpt-5-high is a monster.

virtualritz

4 months ago

When you say Claude Code, what model do you refer to? CC with Opus still outperforms Codex (gpt-5-codex) for me for anything I do (Rust, computer graphics-related).

However, Anthropic restricted Opus use for Max plan users 10 days or so ago severly (12-fold from 40h/week down to 5h week) [1].

Sonnet is a vastly inferioir model for my use cases (but still frequently writes better Rust code than Codex). So now I use Codex for planning and Sonnet for writing the code. However, I usually need about 3--5 loops with Codex reviewing, Sonnet fixing, rinse & repeat.

Before I could use one-shot Opus and review myself directly, and do one polish run following my review (also via Opus). That was possible from June--mid October but no more.

[1] https://github.com/anthropics/claude-code/issues/8449

deaux

4 months ago

Agreed that Opus is stronger than Sonnet 4.5 and GPT-5 High. It's the bitter pill - bigger, more expensive models are just "smarter", even if it doesn't always show in synthetic benchmarks. Similar with o1-pro (now almost a year old, an eternity in this space) vs GPT-5 high. There's also GPT-5 Pro now, which comes at an API cost of $120/M output, and is also noticeably smarter, just like Opus.

They all like to push synthetic benchmarks for marketing, but to me there's zero doubt that both Anthropic and OpenAI are well aware that they're not representative of logical thinking and creativity.

p337

4 months ago

On the topic of comparing OpenAI models with Anthropocene models, I have a hybrid approach that seems really nice.

I set up an MCP tool to use gpt-5 with high reasoning with Claude Code (like tools with "personas" like architect, security reviewer, etc), and I feel that it SIGNIFICANTLY amplifies the performance of Claude alone. I don't see other people using LLMs as tools in these environments, and it's making me wonder if I'm either missing something or somehow ahead of the curve.

Basically instead of "do x (with details)" I say "ask the architect tool for how you should implement X" and it gets into this back and forth that's more productive because it's forcing some "introspection" on the plan.

jrk

4 months ago

This is an established, though advanced, idea.

Sourcegraph Amp (https://sourcegraph.com/amp) has had this exact feature built in for quite a while: "ask the oracle" triggered an O1 Pro sub-agent (now, I believe, GPT-5 High), and searching can be delegated to cheaper, faster, longer-context sub-agents based on Gemini 2.5 Flash.

Rebuff5007

4 months ago

> We were heavy users of Claude Code ($70K+ spend per year)

Claude code has only been generally available since May last year (a year and half ago)... I'm surprised by the process that you are implying; within a year and a half, you both spent 70k on claude code, and knew enough about it and its competition to switch away from it? I dont think I'd be able to due diligence even if LLM evaluation was my fulltime job. Let alone the fact that the capabilities of each provider are changing dramatically every few weeks.

CompoundEyes

4 months ago

Claude Code is still good but I don’t TRUST it. With Claude Code and Sonnet I’m expecting failure. I can get things done but there’s an administrative overhead of futzing around with markdown files, defensive commit hooks and unit tests to keep it on rails while managing the context panic. Codex CLI with gpt-5-codex high reasoning is next gen. I’m sure Sonnet 5 will match it soon. At that point I think a lot of the workflows people use in Claude Code will be obsolete and the sycophancy will disappear.

didibus

4 months ago

Interesting, I find codex CLI is really bad, like the worst coding agent I've tried.

Fails to escalate permissions, gets derailed, loves changing too many things everywhere.

GPT5 is good, but codex is not.

dudeinhawaii

4 months ago

In agreement. Large caveats that can explain differing opinions (that I've experienced) are:

* Is really only magic on Linux or WSL. Mediocre on Windows

* Is quite mediocre at UI code but exceptional at backend, engineering, ops, etc. (I use Claude to spruce up everything user facing -- Codex _can_ mirror designs already in place fairly well).

* Exceptional at certain languages, OK at others.

* GPT-5 and GPT-5-Codex are not the same. Both are models used by the Codex CLI and the GPT-5-Codex model is recent and fantastically good.

* Codex CLI is not "conversational" in the way that Claude is. You kind of interact with it differently.

I often wonder about the impact of different prompting styles. I think the WOW moment for me is that I am no longer returning to code to find tangled messes, duplicate silo'd versions of the same solution (in a different project in the same codebase), or strangely novice style coding and error handling.

As a developer for 20yrs+, using Codex running the GPT-5-Codex model has felt like working with a peer or near-peer for the first time ever. I've been able to move beyond smaller efforts and also make quite a lot of progress that didn't have to be undone/redone. I've used it for a solid month making phenomenal progress and able to offload as-if I had another developer.

Honestly, my biggest concern is that OpenAI is teasing this capable model and then pulls the rug in a month with an "update".

As for the topic at hand, I think Claude Code has without a doubt the best "harness" and interface. It's faster, painless, and has a very clean and readable way of laying out findings when troubleshooting. If there were a cheap and usable version of Opus... perhaps that would keep Claude Code on the cutting edge.

tstrimple

4 months ago

> I've seen Claude Code literally just give up on a hard problem and tell me to buy something off the shelf

I've been seeing more of this lately despite initial excellent results. Not sure what's going on, but the value is certainly dropping for me. I'll have to check out codex. CLI integration is critical for me at this point. For me it is the only thing that actually helps realize the benefits of LLM models we have today. My last NixOS install was completely managed by Claude Code and it worked very well. This was the result of my latest frustrations:

https://i.imgur.com/C4nykhA.png

Though I know the statement it made isn't "true". I've had much better luck pursuing other implementation paths with CC in the same space. I could have prompted around this and should have reset the context much earlier but I was drunk "coding" at that point and drove it into a corner.

slaymaker1907

4 months ago

I haven’t used Codex a lot, but GPT-5 is just a bit smarter in agent mode than Claude 4.5. The most challenging thing I’ve used it for is for code review and GPT-5 somewhat regularly found intricate bugs that Claude missed. However, Claude seemed to be better at following directions exactly vs GPT-5 which requires a lot more precision.

mi_lk

4 months ago

What model are you using respectively? Not sure I share your observations

mmaunder

4 months ago

Have tried all and continue to eval regularly. I spend up to 14 hours a day. Currently recovering from a herniated disk because I spent 6 weeks sitting at a dining room table, 14 hours a day, leaning foward. Don't do that. lol. So my coverage is pretty good. I'm using GPT5-codex-high for 99% of my work. Also I have a team of 40 folks, about a third of which are software engineers and the other third are cybersecurity analysts, so I get feedback from them too and we go deep on our engineering calls re the latest learnings and capabilities.

unsupp0rted

4 months ago

This was my experience too until a couple weeks ago, when Codex suddenly got dumbed down.

Initially, I had great success with codex medium- I could refactor with confidence, code generally ran on the first or second try, etc.

Then when that suddenly dumbed down to Claude Sonnet 3.5 quality I moved to GPT5 High to get back what had been lost. That was okay for a few days. Now GPT5 High has dropped to Claude Sonnet 3.5 quality.

There's nothing left to fallback to.

durron

4 months ago

Do you find this to still be true with the Sonnet 4.5 model?

extr

4 months ago

IMO Sonnet 4.5 is great but it just isn’t as comprehensive of a thinker. I love Anthropic and primarily use CC day to day but for any tricky problems or “high stakes, this must not have bugs” issues, I turn to Codex. I do find if you let Codex run on it its own too long it will produce comparably sloppy or lacking-in-vision type issues that people criticize Sonnet for, however.

PantaloonFlames

4 months ago

That’s a curious approach. Why would you use both? Why not just use the more reliable dependable option for all purposes?

extr

4 months ago

Sonnet 4.5/CC is faster, more direct, and is generally better at following my intent rather than the letter of my prompt. A large chunk of my tasks are not "solve this concurrency bug" or "write this entire feature" but rather "CLI ops", merging commits, running a linter, deploying a service, etc. I almost use it like it was my shell.

Also while not quite as smart, it's a better pair programmer. If I'm feeling out a new feature and am not sure how exactly it should work yet, I prefer to work with Sonnet 4.5 on it. It typically gives me more practical and realistic suggestions for my codebase. I've noticed that GPT-5 can jump right into very sophisticated solutions that, while correct, are probably not appropriate.

Sonnet 4.5: "Why don't we just poll at an interval with exponential backoff?"

GPT-5: "The correct solution is to include the data in the event stream...let us begin by refactoring the event system to support this..."

That said, if I do want to refactor the event system, I definitely want to use Codex for that.

deaux

4 months ago

Strangely enough this is one of the first times here I see someone with the exact same experience. GPT-5 is very prone to a style that would for most codebases be overengineering. I think as a large part of HN works on huge enterprise FAANG-like code, this is where it shines, so here it gets rave reviews of just being the best overall. But globally, for most developers, it's overengineering and adds a lot of unnecessary code to maintain. Sonnet in that sense remains "every man's coder". I've gone back from 4.5 to 4 now, having spent a good chunk of time with 4.5 it just seems like a slight overall regression with no real upsides besides being a little faster than 4.

extr

4 months ago

Glad I'm not crazy, the tide right now of codex > sonnet is overwhelming. Frankly I think what most people go by is "does the code work" - codex is admittedly relentless. It's very good at producing code that works. But "does it work" is not the end-all-be-all in most cases...

macNchz

4 months ago

I frequently have multiple coding assistants going at once—Gemini 2.5 Pro via Aider as the workhorse for most standard changes, Sonnet 4.5 via Claude Code for question answering, documentation, test case development, or broad based changes to many files in a project, then GPT-5 for more complex diagnostic or architectural type things—I don’t generally like the code it writes, but it will often be able to fix situations where the other models get stuck in some kind of local maxima.

NiloCK

4 months ago

Even inside the claude-code ecosystem, more than ever there are tradeoffs on raw speed vs intelligence vs cost.

Moving a bunch of verbose templated HTML around while watching results on a devserver? Haiku all day. It's a bonus that it's cheaper, but the real treat is its speed.

Adding a feature whose planning will involve intake of several files? Sonnet.

Working specifically on 'copy' or taste issues? Still I tend to prefer Opus here.

Individual experiences may vary!

wrs

4 months ago

In my experience, there isn’t a model that is more dependable for all purposes. They each have some unique strengths.

theshrike79

4 months ago

I'm like 80% sure Sonnet 4.5 is just rebranded Opus.

Sonnet 4 was a coding companion, I could see what it was doing and it did what I asked.

Sonnet 4.5 is like Opus, it generates massive amounts of "helper scripts" and "bootstrap scripts" and all kinds of useless markdown documentation files even for the tinies PoC scripts.

deaux

4 months ago

It's very much not, so I'm more than happy to take that bet - how much are we wagering? Have you ever used each for non-coding tasks?

The generation of helper, markdown and bootstrap scripts are very dependent on your harness.

theshrike79

4 months ago

I paid for "Claude Code", I'm not asking it for stuff about the Mesopotamian empire :)

esafak

4 months ago

I don't. Sonnet is faster too.

mmaunder

4 months ago

Yes. Sadly. And it really does make me sad. I was rooting for Anthropic. Still kinda am.

bgirard

4 months ago

I have a very similar experience. I was heavily invested in Anthropic/Claude Code, and even after Sonnet 4.5, I'm finding that Codex is performing much better for my game development project.

mmaunder

4 months ago

It seems particularly good at high performance programming in low level languages.

pinkbanana21

4 months ago

I found using z.ai a performant and cheap alternative for some tasks: https://z.ai/subscribe?ic=VWKNBI8LR8

Costs are 6x cheaper and it's way faster and good at test writing and tool calling. It some times can be a bit messy though so use Gemini or Claude or codex for that hard problems....

sabareesh

4 months ago

Similar feeling. Seems it is good at certain things and if something doesnt work it want to do things simply and in turn becomes something that you didnt ask for and certain times opposite of what you wanted. On the other hand with codex certain time you feel the AGI but that is like 2 out of 10 sessions. This is primarily may be due to how complete the prompt and how well you define the problems.

poorman

4 months ago

Totally agree. I was just thinking that I wouldn't want this feature for Claude Code but for Codex right now it would be great! I can simply let tasks run in Codex and I know it's going to eventually do what I want. Where as with Claude Code I feel like I have to watch it like a hawk and interrupt it when it goes off the rails.

purnesh

4 months ago

My experience is similar, but for me, Claude Code is still better when designing or developing a frontend page from scratch. I have seen that Codex follows instructions a bit too literally, and the result can feel a little cold.

CC on the other hand feels more creative and has mostly given better UI.

Of course, once the page is ready, I switch to Codex to build further.

citizenpaul

4 months ago

Does no one use Blocks Goose CLI anymore? I went to a hackathon in SF at the beginning of the year and it seemed like 90% of the groups used Goose to do something in their Agent project. I get that the CLI agent scene has exploded since then I just wonder what what is so much better in the competition?

blueside

4 months ago

As we all know here, if the the title of this post was about Codex on the web, the top comment would have been about using Claude instead.

YMMV, but this definitely doesn't track with everything I've been seeing and hearing, which is that Codex is inferior to Claude on almost every measure.

dakom

4 months ago

fwiw I'm happy to see this - been trying to tackle a hairy problem (rendering bugs) and both models fail, but:

1. Codex takes longer to fail and with less helpful feedback, but tends to at least not produce as many compiler errors 2. Claude fails faster and with more interesting back-and-forth, though tends to fail a bit harder

Neither of them are fixing the problems I want them to fix, so I prefer the faster iteration and back-and-forth so I can guide it better

So it's a bit surprising to me when so many people are pickign a "clear winner" that I prefer less atm

dboreham

4 months ago

This is going to be situation normal for 10 years: everyone will need to keep track of "model-du-jour" as each vendor makes incremental improvements.

nadermx

4 months ago

I'm just happy alternatives exist.

pythonbase

4 months ago

What is your general use case with Claude Code / Codex? $70K/year is a significant spend.

013

4 months ago

What are the usage limits for Codex compared to Claude Code?

tonyhart7

4 months ago

is is that better tho????

I thought claude code is still better in tool calling and something like that

asdev

4 months ago

do you use the CLI or the web UI? or both?

jakenuts

4 months ago

Same!

TechDebtDevin

4 months ago

Why waste so much time making your devs dumb. Just code brother, this is the dumbest tech in software right now wasting peoples brains away.

mvkel

4 months ago

This is why Anthropic is a zombie company.

They put all of their eggs in the coding basket, with the rest of their mission couched as "effective altruism," or "safetyism," or "solving alignment," (all terms they are more loudly attempting to distance themselves from[0], because it's venture kryptonite).

Meanwhile, all OpenAI had to do was point their training cannon at it for a run, and suddenly Anthropic is irrelevant. OpenAI's focus as a consumer company (and growing as a tool company) is a safe, venture-backable bet.

Frontier AI doesn't feel like a zero-sum game, but for now, if you're betting on AI at all, you can really only bet on OpenAI, like Tesla being a proxy for the entire EV industry.

[0] https://forum.effectivealtruism.org/posts/53Gc35vDLK2u5nBxP/...

F7F7F7

4 months ago

For non-vibe coding purposes I've found that my $200 Claude (Claude Code) account regularly outperformed my $200 ChatGPT (Codex) account. This was after 2 months of heavily testing both mostly in Terminal TUI/CLI form and most recently with the latest VSCode/Cursor incarnations.

Even with the additional Sora usage and other bells & whistles that ChatGPT @ $200 provides, Claude provides more value for my use cases.

Claude Code is just a lot more comfortable being in your workflow and being a companion or going full 'agent(s)' and running for 30 minutes on one ticket. It's also a lot happier playing with Agents from other APIs.

There's nothing wrong with Anthropic wanting to completely own that segment and not have aspirations of world domination like OpenAI. I don't see how that's a negative.

If anything, the more ChatGPT becomes a 'everything app' the less likely I am to hold on to my $20 account after cancelling the $200 account. I'm finding the more it knows about me the more creeped out and "I didn't ask for this" I become.

mvkel

4 months ago

> There's nothing wrong with Anthropic wanting to completely own that segment and not have aspirations of world domination

It's very clear by their actions (not words) that they are shooting for the moon in order to survive. There is no path to sustainability as a collection of dev tools.

fragmede

4 months ago

Especially now that sama wants us to sext with ChatGPT

simonw

4 months ago

I had a preview of this over the weekend, notes here plus some example PRs: https://simonwillison.net/2025/Oct/20/claude-code-for-web/

It's really solid. It's effectively a web (and native mobile) UI over Claude Code CLI, more specifically "claude --dangerously-skip-permissions".

Anthropic have recognized that Claude Code where you don't have to approve every step is massively more productive and interesting than the default, so it's worth investing a lot of resources in sandboxing.

extr

4 months ago

It’s interesting because I’ve slowly arrived at the opposite conclusion: for much of my practical day to day work, using CC with “allow edits” turned OFF results in a much better end product. I can correct it inline, I pseudo-review the code as it’s produced, etc etc. Codex is better for “fire and forget” features for sure. But Claude remains excellent at grokking intent for problems where you aren’t quite sure what you want to build yet or are highly opinionated. Mostly due to the fact it’s faster and the iteration loop is faster.

simonw

4 months ago

That approach should work well for projects where you are directly working on the code in tandem with Claude, but a lot of my own uses are much more research oriented. I like sending Claude Code off on a mission figure out how to do something.

Here's an example from this morning, getting CUDA working on a NVIDIA Spark: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-co...

I have a few more in https://github.com/simonw/research

extr

4 months ago

Very fair. Interesting how much feedback on models/tools is different right now depending on what you're doing.

fragmede

4 months ago

so hey by the way, have you discovered Wispr Flow or something similar so you can talk to your computer like Scotty does?

simonw

4 months ago

Yeah I've tried it a bit it's not a habit for me yet.

I write code on my phone a lot using ChatGPT voice mode though!

conesus

4 months ago

Here's what I did to make voice (now WisprFlow, before Superwhisper) a habit:

  1. Install Karabiner-Elements, a free macOS keyboard remapper[0]
  2. Map F19 -> F5 (mic button) in Karabiner-Elements
  3. Choose F19 as the voice hotkey in your voice app

And now you can use the handy F5 mic button on your Apple keyboard. WisprFlow automatically has it set for:

  - press and hold to talk
  - double tap for indeterminate listening until you f5/esc

That workflow alone, of using the f5 key and switching between the two modes of speaking (holding or double-tap), has freed up a not insignificant part of my working memory. Turning abstract thoughts into text is higher cost than turning them into voice.

I predict individual offices[1] will be more popular as a choice for startups.

[0]: https://karabiner-elements.pqrs.org

[1]: https://queue.acm.org/detail.cfm?id=1281887

fragmede

4 months ago

fwiw, I use the fn/international key at the bottom left of the keyboard. it's easier to locate and I (a privilege I enjoy because I rarely use diacritics) barely use it for anything else.

jcjmcclean

4 months ago

I also use voice mode a lot, I find it's really useful for talking to while you're shaping an idea or an approach, then asking it to summarise the decisions you've made. Essentially rubber ducking.

vidarh

4 months ago

It slows it down far too much for me. What I've found after swithcing to --dangerously-skip-permissions is that while the intermediate work product is often total junk, when I then start writing a message to tell Claude to switch approach, a large proportion of the time it has figured that out by itself before I'm finished writing the message.

So increasingly I let it run, and then review when it stops, and then I give it a proper review, and let it run until it stops again. It wastes far less of my time, and finishes new code much faster. At least for the things I've made it do.

dbbk

4 months ago

Personally I just prefer setting it to TDD. If the test cases are what I want, and the code passes the tests, all's good.

ryoshu

4 months ago

Agreed. I use CC a lot for exploratory work. It's great with fast iteration for throwaway code.

state_less

4 months ago

> it's worth investing a lot of resources in sandboxing.

I tend to agree. There’s an opportunity to make it easy to have Claude be able to test out workflows/software within Debian, RPM, Windows, etc… container and VM sandboxes. This could be helpful for users that want to release code on multiple platforms and help their own training and testing, which they seem to be heavily invested in given all the “How Am I doing?” prompts we’re getting.

username223

4 months ago

Do you have a practical sense of the level of mischief possible in the sandbox? It seems like a game of regexp whack-a-mole to me, which seems like a predictable recipe for decades of security problems. Allow- and deny-lists for files and domains seem about as secure as backslash-escaping user input before passing it to the shell.

simonw

4 months ago

If you configure it with the "no network access" environment there's nothing bad that can happen. Worst is you end up wasting a bunch of CPU cycles in a container somewhere in Anthropic's infrastructure.

Their "restricted network access" setting looks questionable to me - it allow-lists a LOT of stuff: https://docs.claude.com/en/docs/claude-code/claude-code-on-t...

If you configure your own allow-list you can restrict to just domains that you trust - which is enforced by a separate HTTP/HTTPS proxy, described here: https://docs.claude.com/en/docs/claude-code/claude-code-on-t...

adastra22

4 months ago

How do you run a remote LLM with no network access?

simonw

4 months ago

OpenAI Codex, Claude Code for web and Gemini Jules have all managed that.

You use firewalls to prevent code running inside the container from opening network connections to anywhere else. The harness that surrounds it can still be made accessible via the network.

cyrusradfar

4 months ago

great points @simonw - I, incredibly, haven't ever tried --dangerously-skip-permissions yet for any "real" projects. I generally find that it stops itself for good reason.

brynary

4 months ago

The most interesting parts of this to me are somewhat buried:

- Claude Code has been added to iOS

- Claude Code on the Web allows for seamless switching to Claude Code CLI

- They have open sourced an OS-native sandboxing system which limits file system and network access _without_ needing containers

However, I find the emphasis on limiting the outbound network access somewhat puzzling because the allowlists invariably include domains like gist.github.com and dozens of others which act effectively as public CMS’es and would still permit exfiltration with just a bit of extra effort.

minimaxir

4 months ago

Link to the GitHub for the native sandboxing: https://github.com/anthropic-experimental/sandbox-runtime

navanchauhan

4 months ago

I used `sandbox-exec` previously before moving to a better solution (done right, sandboxing on macOS can be more powerful than Linux imo). The way `sandbox-exec` works is that all child processes inherit the same restrictions. For example, if you run `sandbox-exec $rules claude --dangerously-skip-permissions`, any commands executed by Claude through a shell will also be bound by those same rules. Since the sandbox settings are applied globally, you currently can’t grant or deny granular read/write permissions to specific tools.

Using a proxy through the `HTTP_PROXY` or `HTTPS_PROXY` environment variables has its own issues. It relies on the application respecting those variables—if it doesn’t, the connection will simply fail. Sure, in this case since all other network connection requests are dropped you are somewhat protected but then an application that doesn't respect them will just not work

You can also have some fun with `DYLD_INSERT_LIBRARIES`, but that often requires creating shims to make it work with codesigned binaries

joshdev

4 months ago

What is the better solution you’ve moved on to?

navanchauhan

4 months ago

Endpoint Security Extension and Network Extension

kylehotchkiss

4 months ago

Could this be used for Xcode-server? I dont like how it has access to full host filesystem

fragmede

4 months ago

Exfiltration is always going to be possible, the question is, is it difficult enough for an attacker to succeed against the defenses I've put in place. The problem is, I really want to share, and help protect others, but if I write it up somewhere anybody can read, it's gonna end up in the training data.

koolala

4 months ago

The attacker being an LLM where all humans have to be careful what they say publicly online is a fun vector.

merrvk

4 months ago

Nice its in the app, trying it out, seems damn buggy at the moment.

mdeeks

4 months ago

I feel like these background agents still aren't doing what I want from a developer experience perspective. Running in an inaccessible environment that pushes random things to branches that I then have to checkout locally doesn't feel great.

AI coding should be tightly in the inner dev loop! PRs are a bad way to review and iterate on code. They are a last line of defense, not the primary way to develop.

Give me an isolated environment that is one click hooked up to Cursor/VSCode Remote SSH. It should be the default. I can't think of a single time that Claude or any other AI tool nailed the request on the first try (other than trivial things). I always need to touch it up or at least navigate around and validate it in my IDE.

ewoodrich

4 months ago

Right, that is closer to what I was hoping this announcement would be. I really just want a (mobile/web) companion to whatever CLI environment I have Claude Code running in. That would perfectly fill in the exact niche missing in my local dev server VM setup I remote into with any combination of SSH, VS Code Remote, or via Web (VS Code Tunnel from vscode.dev and a ttyd remote CLI session in the browser).

It would be great to be able to check in on Claude on a walk or something to make sure it hasn't gone off the rails or send it a quick "LGTM" to keep moving down a large PLAN.md file without being tethered to a keyboard and monitor. I can SSH from my phone but the CLI ergonomics are ... not great with an on screen keyboard, when all it really needs is just needs a simple threaded chat UI.

I've seen a couple Github projects and "Happy Coder" on a Show HN which I haven't got around to setting up yet which seem in the ballpark of what I want, but a first party integration would always be cool.

Yeroc

4 months ago

I tried Happy Coder for a bit. It seemed exactly what I was missing but about 1/2 the time session notifications weren't coming through and the developers of the tool seem busy pushing it off in other directions rather than in making the core functionality bullet-proof so I gave up on it. Unfortunate. Hopefully something else pops up or Anthropic bakes it into their own tooling.

luisml77

4 months ago

I agree and I also think the problem is deeper than that. It's about not being able to do most code testing and debugging remotely. You can't really test anything remotely really... Its in an ephemeral container without any of your data, just your repo. You can't have the model do npm run dev and browse to see the webpage, click around, etc. You can't compile or run anything heavy, you can't persist data across sessions/days, etc.

I like the idea of background agents running in the cloud but it has to be a more persistent environment. It also has to run on a GUI so it can develop web applications or run the programs we are developing, and run them properly with the GUI and requiring clicking around, typing things etc. Computer use, is what we need. But that would probably be too expensive to serve to the masses with the current models

daxfohl

4 months ago

Definitely sounds cool. But the problem hasn't even been solved locally yet. Distributed microservices, 3rd party dependencies, async callbacks, reasonable test data, unsatisfiable validations, etc. Every company has their own hacked together local testing thing that mostly doesn't work.

That said, maybe this is the turning point where these companies work toward solving it in earnest, since it's a key differentiator of their larger PLATFORM and not just a cost. Heck, if they get something like that working well, I'd pay for it even without the AI!

Edit: that could end up being really slick too if it was able to learn from your teammates and offer guidance. Like when you're checking some e2e UI flows but you need a test item that has some specific detail, it maybe saw how your teammate changed the value or which item they used or created, and can copy it for you. "Hey it looks like you're trying to test this flow. Here's how Chen did it. Want me to guide you through that?" They can't really do that with just CLI, so the web interface could really be a game changer if they take full advantage of it.

mdeeks

4 months ago

What you're describing feels like the next major evolution and is likely years away (and exciting!).

I'm mainly aiming for a good experience with what we have today. Welding an AI agent onto my IDE turned out to be great. The next incremental step feels like being able to parallelize that. I want four concurrent IDEs with AI welded onto it.

user

4 months ago

[deleted]

luisml77

4 months ago

Exactly, I want to go to sleep knowing I have an AI working in a computer developing my project. Then wake up to the finished website/program, fully tested top to bottom backend frontend UI etc.

elpakal

4 months ago

> PRs are a bad way to review and iterate on code

idk, we’ve (humans) gotten this far with them. I don’t think they are the right tool for AI generated code and coding agents though, and that these circles are being forced to fit into those squares. imho it’s time for an AI-native git or something.

mdeeks

4 months ago

PRs work well for what they are. Ship off some changes you're strongly confident about and have another human who has a lot of context read through it and double check you. It's for when you think you've finished your inner loop.

AI is more akin to pair programming with another person sitting next to you. I don't want to ship a PR or even a branch off to someone sitting next to me. I want to discuss and type together in real time.

sails

4 months ago

Agree, each agent creating a PR and then coordinating merges is a pain.

I’d like

- agent to consolidate simple non-conflicting PRs

- faster previews and CI tests (Render currently)

- detect and suggest solutions for merge conflicts

Codex web doesn’t update the PR which is also something to change, maybe a setting, but for web Code agents (?) I’d like the PR once opened to stay open

Also PRs need an overhaul in general. I create lots of speculative agents, if I like the solution I merge, leading to lots of PRs

archon810

4 months ago

Thank you. Every time these agentic cloud tools come out, I wonder to myself whether I'm not using them right or misunderstand vs, say, local Cursor development paradigm.

Plus they generate so much noise with all the extra commits and comments that go to everyone in slack and email rather than just me.

icelancer

4 months ago

I just run the agent directly on separate testing/dev servers via remote-ssh in VS Code to have an IDE to sanity check stuff. Just far simpler than local dev and other nonsense.

cyrusradfar

4 months ago

this is a great point. The inner / outer loop is big. I think AI pushing PRs is kind of like pushing drafts to the public in social media. I don't want folks seeing PRs and such until I feel good about it. It adds a lot of noise, and increases build costs unless your CI/CD treats them differently which I don't know anyone doing.

justinram11

4 months ago

Have you checked out Ona [1] (gitpod's pivot)?

[1] https://ona.com/

mdeeks

4 months ago

This is possibly what I want? It's hard to tell from all of the marketing on the site.

I want to run a prompt that operates in an isolated environment that is open in my IDE where I can iterate with the AI. I think maybe it can do this?

simonw

4 months ago

Not quite. This doesn't (yet) have an option where you can connect your local IDE to their remote containers to edit files directly. It's more of a fire-and-forget thing where you can eventually suck the resulting code down to your local machine using "claude --teleport ..." - but then it's not running in the cloud any more.

tomvault

4 months ago

CEO at Ona (formerly Gitpod) here. Every ephemeral environment Ona creates can directly connect to your Desktop IDE for easy handoff. Our team goes from prompt -> iterating in conversation -> VS Code Web -> VS Code Desktop/Cursor depending on task complexity and quality of the agent output. We call this progressive engagement and have written about it here https://ona.com/docs/ona/best-practices#progressive-engageme...

mdeeks

4 months ago

Thanks, I'll give it a shot. I wish your site would show me what it actually looks like. It's a lot of words and fancy marketing images and I have no feel for the product. It leaves me unsure if I should invest my time.

I'd love to see a short animation of what it would actually look like to do the core flow. Prompt -> environment creation -> iterating -> popping open VSCode Web -> Popping open Cursor desktop.

Also, a lot of the links on that page you linked me to are broken:

  * "manual edits and Ona Agents is very powerful." 
  * "Ona’s automations.yaml extends it with tasks and services"
  * "devcontainer.json describes the tools"

mdeeks

4 months ago

I signed up and tried it with Cursor. It is very close, but still has a lot of rough edges that make it hard to switch:

  * Once in Cursor I can't click on modified files or lines and have my IDE jump to it. Very hard to review changes.
  * I closed the Ona tab and couldn't figure out how to get it back so I could prompt it again.
  * I can't pin the Ona tab to the right like Cursor does
  * Is there a way to select lines and add them to context?
  * Is there a way I can pick a model?

pdntspa

4 months ago

Yes but pointy-haired bosses are much more amenable to the sales pitch of, "insert story, receive PR"

asdev

4 months ago

so the biggest issue is having to pull down and manually edit changes? can't you just @claude on the PR to make any changes?

mdeeks

4 months ago

Yes, but my point is often times I don't want to. Sometimes there are changes I can make it seconds. I don't want to wait 15+ seconds for an AI that might do it wrong or do too much.

Also it isn't always about editing. It is about seeing the surrounding code, navigating around, and ensuring the AI did the right thing in all of the right places.

TechDebtDevin

4 months ago

Huge waste of time. You are being sold a bill of goods whose only purpose is to make you a dumb dev. Like woah, an llm can use cdp!! Who cares. Cant wait till people start waking up to this grift. These things are making people so dumb and a few richer, thats it.

cindyllm

4 months ago

[dead]

kanjun

4 months ago

Hey, Kanjun from Imbue here! This is exactly why we built Sculptor (https://imbue.com/sculptor), a desktop UI for Claude Code.

Each agent has its own isolated container. With Pairing Mode, you can sync the agent's code and git state directly into your local Cursor/any IDE so you can instantly validate its work. The sync is bidirectional so your local changes flow back to the agent in realtime.

Happy to answer any questions - I think you'll really like the tight feedback loop :)

jackconsidine

4 months ago

> We were heavy users of Claude Code ($70K+ spend per year) and have almost completely switched to codex CLI

Seeing comments like this all over the place. I switched to CC from Cursor in June / July because I saw the same types of comments. I switched from VSCode + Copilot about 8 months before that for the same reason. I remember being skeptical that this sort of thing was guerilla marketing, but CC was in fact better than Cursor. Guess I'll try Codex, and I guess that it's good that there are multiple competing products making big strides.

Never would have imagined myself ditching IDEs and workflows 3x in a few months. A little exhausting

rorads

4 months ago

I think it’s a lot less exhausting now that the IDE part is mostly decoupled. I can’t imagine cursor continuing to compete when really all they’re doing is selling tokens either a markup, and hence crushing your context on every call. Sorry if that sounds negative but it’s true.

I use CC and codex somewhat interchangeably, but I have to agree with the comments. Codex is a compete monster, and there really isn’t any competition right now.

grrowl

4 months ago

OpenAI seems to limit how "hard" your gpt-5-codex can think depending on your subscription plan; whereas Anthropic/Claude only limits how much use you get. I evaluate Codex every month or so with a problem suited to it, but rarely gets merged over a version produced by Charlie (which yes is $500/mo, but rarely causes problems) or something Claude did in a managed or unmanaged session. ymmv

ea016

4 months ago

No relations to them, but I've started using Happy[0]'s iOS app to start and continue Claude Code sessions on my iPhone. It allows me to run sessions on a custom environment, like a machine with a GPU to train models

[0] https://github.com/slopus/happy/

hmokiguess

4 months ago

This seems to be the only solution still if using bedrock or direct API access instead of Pro / Max plan, the Claude Code for Web doesn't seem to let you use it that way.

didgeoridoo

4 months ago

You can log in to your CC instance however you like, including via Pro/Max. Happy just wraps it and provides remote access with a much better UI than using a phone-based terminal app.

hmokiguess

4 months ago

Yes, that's precisely what I meant! I was talking with regards to the parent article about Claude Code on the Web via Anthropic.

TechDebtDevin

4 months ago

Are you people just lighting money on fire? What could you possibly get done via a phone that is meaningful.

yoavm

4 months ago

I was just working on something similar for OpenCode - pushing it now in case it's useful for someone[0].

It can run in a front-end only mode (I'll put up a hosted version soon), and then you need to specify your OpenCode API server and it'll connect to it. Alternatively, it can spin up the API server itself and proxy it, and then you just need to expose (securely) the server to the internet.

The UI is responsive and my main idea was that I can easily continue directing the AI from my phone, but it's also of course possible to just spin up new sessions. So often I have an idea while I'm away from my keyboard, and being up able to just say "create an X" and let it do its thing while I'm on the go is quite exciting.

It doesn't spin up a special sandbox environment or anything like that, but you're really free to run it inside whatever sandboxing solution you want. And unlike Claude Code, you're of course free to choose whatever model you want.

[0] https://github.com/bjesus/opencode-web

fny

4 months ago

I've been using Happy Coder[0] for some time now on web and mobile. I run it `--yolo` mode on an isolated VM across multiple projects.

With Happy, I managed to turn one of these Claude Code instances into a replacement for Claude that has all the MCP goodness I could ever want and more.

[0]: https://happy.engineering/

ShipEveryWeek

4 months ago

This looks nice! I’ve been using terminus + tailscale to get similar results, but I’ll give this a go

nojs

4 months ago

This is going to be extremely useful. A lot of people have hacked together similar things to get around waiting for CC to finish without mangling worktrees and branches manually.

I was curious how the 'Open in CLI' works - it copies a command to clipboard like 'claude --teleport session_XXXXX', which opens the same chat in the CLI, and checks out a new branch off origin/main which it's created for the thread, called 'claude/feature-name-XXXXX'.

I prefer not to use CC at the 'PR level' because it still needs too much hand-holding, so very happy to see that they've added this.

Update: Session titles are either being leaked between users or have a very bad LLM writing them. I'm seeing "Update Ton Blockchain Configuration" and "Retrieve Current PIN Code" for a project that has nothing to do with blockchain or PIN codes...

kofman

4 months ago

Our title generation is done preemptively as you type and is a bit too eager to declare it has enough context. We’ll tweak it.

bonesss

4 months ago

Sounds like a fully functioning feature to me: “make up a task to force the project managers eyes to glaze over” ;)

“Quick, Claude, what have I supposedly been WorkingOn all morning? … ‘Blockchain Token Configuration Update’, perfect!”

Redster

4 months ago

Here's the link talking about the sandbox environment and features they're using for this Claude Code. https://www.anthropic.com/engineering/claude-code-sandboxing

hmokiguess

4 months ago

Soon —> https://xkcd.com/2044/

charlesabarnes

4 months ago

It's pretty frustrating that every release is IOS first without any timeline or expectation for Android

outime

4 months ago

This may explain it: https://9to5mac.com/2023/09/06/iphone-users-spend-apps/

poly2it

4 months ago

It is also relevant to know if a user who'd otherwise use app X on iOS would use X less on Android.

rldjbpin

4 months ago

this makes sense for apps where you pay up front to use it. for subscription-based services, this idea falls flat.

in fact, apple made it harder for apps to take payments from its users in the past than others.

alwillis

4 months ago

Not unusual; most high profile apps ship on iOS first, going back to Instagram [1], which was released October 10, 2010. Instagram shipped their Android version 1.5 years later.

[1]: https://www.techtarget.com/searchcio/definition/Instagram

spondyl

4 months ago

Another, not incompatible explanation is that it's also just easier to develop for a handful of known iOS/iPadOS targets compared to Android's unbounded set of screen sizes and device specs.

wahnfrieden

4 months ago

If your app runs on iPadOS, you already need to support every "screen size" (window size)

Android is simply a much worse platform to make money on. Users spend <25% as much as iOS users. Why would they prioritize that?

djmips

4 months ago

In practice Android is much more difficult to handle the myriad of offerings - Have you ever tried both? To your other point, what app spend would Anthropic be worried about - they have a subscription model.

wahnfrieden

4 months ago

They sell through the app, too. And Android users are just as unlikely to spend outside of apps as they are inside them. Android deprioritization is a business decision, not a technical complexity decision.

mh-

4 months ago

Anthropic supports in-app purchases for Claude subscriptions, at least in the US.

pjmlp

4 months ago

Because 70% of the mobile phone world runs on Android.

It is like trying to make a living selling games to macOS users.

bapak

4 months ago

Please take a look at the percentage of paying Android users. It just does not compare. It's useless to count 2 billion users in third world countries who never have and never will pay anything in-app.

wahnfrieden

4 months ago

Users spend <25% as much as iOS users, and less than half in total despite larger user counts (having double the users who spend <25% each does not add up!), a gap that widens year over year. Why would they prioritize that?

Why would they care about prioritizing users who spend much less? Android pays <25% per user. You need a LOT more than 70% to make that worth prioritizing. Those users are just going to eat up free tier resources without paying. It's borderline parasitic from a business perspective.

Android users are more likely to be useful for spreading word-of-mouth reputation to Apple platform users, than they are as direct spenders. Just another reason to ensure Apple platform features don't trail Android.

pjmlp

4 months ago

iOS/iPadOS aren't exactly the same, without bothering to count, there are about 10 screen sizes to account for, and Apple contrary to Android world, doesn't have somethine like JetPack, either the user updates their phone or there are no new features for the apps to rely on.

pjmlp

4 months ago

It is basically a US centric view of mobile OS market share.

aaronbrethorst

4 months ago

Anthropic is a US-based company.

pjmlp

4 months ago

Some companies would rather have a more international user base.

https://gs.statcounter.com/os-market-share/mobile/worldwide

So maybe they rather please the home market, I guess.

OJFord

4 months ago

With a global market and extant user base.

richardw

4 months ago

It’s much harder dealing with all the complexities of different devices, screen sizes, OS versions.

https://www.reddit.com/r/applesucks/comments/1k6m2fi/why_do_...

bahmboo

4 months ago

Anthropic and Apple have a strategic partnership. It's a bit dicey but still seems to be in play. Which is interesting considering Google is a major investor and Apple is not. Anthropic wants Apple as a paying customer. Apple wants them to bend the knee.

lvl155

4 months ago

Apple also has relationship with OAI. They’re not preferential.

bahmboo

4 months ago

Yes but the question was why Anthropic is showing more attention to iOS vs Android.

wahnfrieden

4 months ago

Android is a tiny market

OJFord

4 months ago

You probably mean 'in the US', where iOS is 58%. Android has a 71% global market share.

bdcravens

4 months ago

Yes, if all you consider are the number of devices in use. However once you segment by devices with performance to run a given app and financial demographics that match your target customer, the numbers change.

wahnfrieden

4 months ago

No. Why do user counts matter? High user count but with >4x thriftiness / aversion to spending is not an attractive market over iOS.

Globally in dollars spent, not human heads. iOS is over 2x larger than Android globally, and the gap is widening year over year.

iOS spending growth outpaces Android, which even shrunk during covid while iOS spending continued to grow

https://api.backlinko.com/app/uploads/2024/03/iphone-vs-andr...

Anthropic makes money off product sales, not ad revenue, so wallets count more than eyes for this. Free users who are less than 25% as likely to spend are a burden not to be prioritized for a product business with free tier access. They need to spend much more to get a paying user on Android.

If Android were the bigger market, they'd prioritize it

teunlao

4 months ago

Been using both daily for three months. Different tools for different jobs.

Claude Code has better UX. Period. The permission system, rollbacks, plan mode - it's more polished. Iterative work feels natural. Quick fixes, exploratory coding, when I'm not sure exactly what I want yet - Claude wins.

Codex is more reliable when stakes are high. Hard problems. Multi-file refactors. Complex business logic. The model just grinds through it. Less hand-holding needed.

Here's the split I've landed on - Claude for fast iteration tasks where I'm actively involved. Codex for delegate-and-walk-away work that needs to be right first time.

Not about which is "better" - wrong question. It's about tooling vs model capability. Claude optimized the wrapper. OpenAI optimized the engine.

pimterry

4 months ago

Personally, my one annoyance here is that it requires you to install a GitHub App that gives it direct write permissions to all code in your repos (in addition to issues, PRs, etc).

I'd much rather give it read permissions, have it work in its own clone, and then manually pull changes back through (either with a web review UI somehow, or just pulling the changes locally). Partly for security, partly just to provide a good review gate.

Would also allow using this with other people's repos, where I _can't_ give write permissions, which would be super helpful for exploring dependency repos, or doing more general research. I've found this super helpful with Claude Code locally but seems impossible on the web right now.

jryio

4 months ago

Pair programming is still one of the best ways to knowledge transfer between two programmers in a high throughput manner. Humans learn by doing, building synaptic connections.

I wonder if a shared Claude Code instance has the same effect?

dingnuts

4 months ago

The person driving is the one that learns the most in pair programming. In the scenario you've described, that would be Claude. LLMs don't learn.

Doesn't CC sometimes take twenty, thirty minutes to return an attempt? I wouldn't know, because I'm not rich and my employer has decided CC is too expensive, but I wonder what you would do with your pair programming partner while you wait.

The bosses would like to think we'd start working on something else, maybe start up a different Claude instance, but can you really change contexts and back before the first one is done? You AND your partner?

Nah, just go play air hockey until your boss realizes Claude is what they need, not you.

myko

4 months ago

> Nah, just go play air hockey until your boss realizes Claude is what they need, not you.

This is a depressing comment.

I am apprehensive about the future of software development in this milieu. I've pumped out a ~15,000 line application heavily utilizing Claude Code over a few days that seems to work, but I don't know how much to trust it.

Certainly part of the fun of building something was missing during that project, but it was still fun to see something new come to life.

Maybe I should say I am cautiously optimistic but also concerned: I don't feel confident in the best ways to use these tools to build good software, and I'm not sure exactly what skills are useful in order to get them there.

losteric

4 months ago

> I've pumped out a ~15,000 line application heavily utilizing Claude Code over a few days that seems to work, but I don't know how much to trust it.

Can I ask what you built?

myko

4 months ago

https://github.com/Chuntttttt/TapeDeck

There was a post recently where someone linked to: https://simplyexplained.com/blog/how-i-built-an-nfc-movie-li...

and I thought the project was amazing, but I didn't like how the IDs were managed in yml, so I built this to make it more dynamic. I plan to add support for other smart home automations with it as well as more streaming services.

One of the features I really like about it is it makes it easy to print and cut out stickers to slap on the NFC cards for playing media.

My toddler loves it so far and one of his friend's has asked me to make one for him as well

lazerwalker

4 months ago

I have criticisms of both tools like Claude Code and how applicable the 'pair programming' metaphor is here, but strong disagree that the person driving during pairing is the one who learns the most (or perhaps the implied "and the non-driver doesn't learn enough"). A good dynamic pairing session is equally valuable for both participants, even if there's a skill gap, and even if you're not alternating drivers as often as you should.

astrange

4 months ago

You can get plenty of CC on a $20/month plan.

user

4 months ago

[deleted]

mr_mitm

4 months ago

Just for the record, CC is about the cost of a Netflix subscription, and it responds faster than any human can.

dgunay

4 months ago

I haven't seen anyone talking about using agents in this way. I wonder if it would be helpful for learning e.g. a new language or codebase to have the human write the code while the agent takes the role of the "backseat driver" in the pair programming dynamic.

artdigital

4 months ago

So is this their version of Jules / Codex / Copilot agent? Aka autonomous agent in the cloud you give a task and it spits out a PR a bit later?

It’s interesting how all the LLMs slowly end up with the same feature set and picking one really ends up with personal preference.

Me as a dev am happy that I now have 4 autonomous engineers that I can delegate stuff to depending on task difficulty and rate limits. Even just Copilot + Codex has made me a lot more productive

Also rip to all the startups that tried to provide “Claude in the cloud”, though this was very predictable to happen

ubj

4 months ago

Very curious to see what usage limits are like for paid plans. Anthropic was already experiencing issues with high-volume model usage for Pro and Max users. I hope their infrastructure is able to adequately support running these additional coding environments on top of model inference.

Just to be clear, I'm excited for the capability to use Claude Code entirely within the browser. However, I've heard reports of Max users experiencing throttled usage limits in recent months, and am concerned as to whether this will exacerbate that issue or not.

CharlesW

4 months ago

Anecdotally, as a Max user typically using Claude Code for >8 hours/day, I've never experienced that. That said, I'm not one of those people using Opus for everything, and in fact I've been happy using Sonnet 4.5 even for planning.

user

4 months ago

[deleted]

minimaxir

4 months ago

I suspect the release of Claude Haiku 4.5 was done to help reduce usage costs for Anthropic and any use of Claude Code will differ to it if capacity is limited.

EDIT: I had meant defer which is the first time I've made a /r/boneappletea in awhile

chrisweekly

4 months ago

"differ"? did you mean "default"?

scubbo

4 months ago

I imagine "defer"

martypitt

4 months ago

It's interesting how most of these tools are (exclusively) Github.

We're on Gitlab for historic reasons. Where Github now has numerous opporuntities to use AI as part of your workflow, there's nothing in Gitlab (from what I can tell), unless you're paying big bucks.

I like using AI to boost my productivity. I'm surprised that that'll be the thing that makes me migrate to Github.

brainless

4 months ago

It is not easy to support multiple providers at each layer of our tech stack but if we do not then a few players become easier to pick and then they monopolize.

I am trying to stay vendor neutral with my own coding agent (1). To approach this, I created a desktop app that connects to coding agent running on either my own infra or yours (local or your cloud server). Desktop app and coding agent are separate binaries.

If you host on your own infra then you can bring your own AI provider too. Similarly, I want to give choice for git host. Right now I am targetting GitHub but I want to add Gitlab soon after MVP. All this has made the path to my MVP longer but I see a clear long-term aim for myself - we should have choices.

1. https://github.com/brainless/nocodo

neilv

4 months ago

Nit about doing your AI interfaces on the Web: I really want claude.ai and chatgpt.com to offer a standard username+password login without 2FA. The kind my privacy-friendly browser of short-lived sessions can complete in a couple clicks, like for most other SaaSes, and then I'm in and using the tool.

I don't want to leak data either way by using some "let's throw SSO from a sketchy adtech company into the trust loop".

I don't want to wait a minute for Anthropic's login-by-email link, and have the process slam the brakes on my workflow and train of thought.

I don't want to wait a minute for OpenAI's MFA-by-email code (even though I disabled that in the account settings, it still did it).

I don't want to deal with desktop clients I don't trust, or that might not keep up with feature improvements. Nor have to kludge up a clumsy virtualization sandbox for an untrusted client, just to ask an LLM questions that could just be in a Web browser.

linkregister

4 months ago

In the modern age of mass credential stuffing attacks exploiting password reuse, MFA is one of the most effective tools for reducing unauthorized logins. Companies that don't adopt it are risking unacceptably high levels of credit card chargebacks.

I wish the standard were for companies to check new passwords against leaked password lists, e.g. what https://haveibeenpwned.com uses.

I use a similar workflow and have found that websites that allow passkey-based login can avoid the friction of waiting for TOTP codes or magic links.

amluto

4 months ago

How about using supporting WebAuthn?

The current claude.ai signin mechanism is rather annoying.

jngiam1

4 months ago

I got so used to having Claude Code read some of my MCP tools, and was bummed to see that it couldn't connect to them yet on the web.

Pretty cool though! Will need to use it for some more isolated work/code edits. Claude Code is now my workhorse for a ton of stuff including non-coding work (esp. with the right MCPs)

mrcwinn

4 months ago

We’re moving almost entirely to Codex, first because often it’s just better, and second because it’s much cheaper. It’s a bet that they’re better now, but given capacity and funding, they’ll be better later too.

The only edge Claude has is context window, which we do sometimes hit, but I’m sure that gap will close.

esafak

4 months ago

You're using the metered API rather than a subscription, right?

asdev

4 months ago

are you using the web ui, cli or both?

cube2222

4 months ago

This is quite nice!

I'm using Claude Code locally a lot, occasionally with a couple parallel session.

I was very happy when they made the GitHub Action - I used it quite a bit, but in practice I got frustrated that I effectively only get a single back-and-forth out of it, I can't really "continue the conversation without losing context" - Sure, I can respond to it in the PR it makes, but that will be a fresh session with a fresh empty context.

So, as much as I don't like moving out of my standard development workflow with my tools, I think this could be quite useful. The ability to interrupt and/or continue a conversation should be very nice.

My main worry is - usually my unit tests and integration tests rely on a postgres database running on the machine, and it's not obvious to me if I can spin that up here?

GreekPete

4 months ago

https://docs.github.com/en/actions/tutorials/use-containeriz...

cube2222

4 months ago

I'm not sure how this applies? We're talking about the "Claude Code on the Web" custom sandbox, not running Claude Code in GitHub Actions.

radial_symmetry

4 months ago

Check out Crystal if you want a good tool for managing parallel sessions locally https://github.com/stravu/crystal

kofman

4 months ago

You can ask Claude to install Postgres and it should just work. We'll have it. In the default image shortly.

anon3459

4 months ago

Use pglite

lysecret

4 months ago

Just played around with it the fact it’s on the phone is a big bonus.

I have setup a little workflow where given linear tags it sets up a work tree on my dev box installs deps and starts the implementation so I can take it over I prefer this workflow to the fully managed cloud based solutions.

This kind of fits in for issues where I’m basically sure I won’t have to take it over (and it can do it fully on its own). Which aren’t that many.

Very simple example there was a warning pop up on something where I thought there shouldn’t be now it’s done fully automatically from my phone in 5 mins. I quite like that these small changes become so easy.

shireboy

4 months ago

I really want this but for Azure Devops. If you're not familiar, Microsoft owns both Github and Azure Devops, and both do similar: git repos and project management. I can use Github Copilot, Claude Code CLI, etc. against code on my disk, including Azure Devops MCP. But what I can't easily do is like Github Copilot Agent and apparently this Claude Code on Web: Assign a ticket to @SomeAi and have a PR show up in a few minutes. Can't change to github for _reasons_.

Would love any suggestions if anyone in a similar story.

jimmydoe

4 months ago

ask your boss to read between the lines of this blog post:

https://developer.microsoft.com/blog/azure-devops-with-githu...

(if they want employee to use more AI, ditch ADO, embrace GitHub)

ed_mercer

4 months ago

Is CC on the web able to spawn local containers? I would need to spawn a half dozen services locally in order to have a proper simulation of my actual working environment. Tool calling and integration with various microservices (e.g. postgres, playwright) is one of the most important uses of CC for us. For example, after telling CC to implement a feature, it needs to test that feature and confirm that any database changes are the way they're supposed to.

arjie

4 months ago

A thing I really like with Claude Code is how well it uses the bash scripts you give it. I also have a browser control MCP installed and it's pretty good for it to full-cycle around the approach. I have a staging database that it has the passwords to that it logs in and runs queries on. This whole thing means it loops and delivers good results for me.

I'll try this, but the grounding seems crucial for these LLMs to deliver results that are fewer shot than otherwise.

hugs

4 months ago

which specific functions/features of the browser control MCP do you lean on the most?

arjie

4 months ago

I don't use it myself so to speak, except to fill in some things sometimes like passwords. The LLM is the user. It just uses the primitives it has (these are my paraphrases): scroll_to, expand_viewport, screenshot, select_dom_element, fill_input. This way I can tell it to implement a feature and verify it and it does so in a Google Chrome testing profile. Without the grounding, I've noticed that LLMs often produce "code that should work" but then something else is missing. This way, by the time I see it, the feature works.

I then have to go in and advise it on factoring and things like that, but the functionality itself is present and working.

r0x0r007

4 months ago

Wow, so nice! Now I can read hacker news, watch youtube shorts and solve tickets and add new features at the same time! What could go wrong!? Thanks AI!

dysoco

4 months ago

So from what I can understand this is only meant to be used with Claude-hosted sandbox environments?

Wouldn't work for my case since I need a lot of HDD space, GPUs etc. to run the thing I'm working on, but it would be great if I could run a Claude Code server in my server, expose the port and then connect via web or iOS interface.

Sure I can use tmux/ssh but it's very impractical specially in mobile.

jimmydoe

4 months ago

    total        used        free      shared  buff/cache   available
    Mem:            13Gi       306Mi        12Gi          0B       126Mi        12Gi
    Swap:             0B          0B          0B

the sandbox has ~12G RAM, but no docker or podman allowed.

unfortunately it doesn't work for me as I need docker compose or equivalent to fire up some env for local test

mholubowski

4 months ago

I’d really appreciate an explanation of this:

How does Codex / Claude Code compare to working within Cursor with the chat and agents? Are they effectively the same thing?

Is one significantly better than the other. Please share your experiences around this I’m trying to be ass effective of an engineer as I can be at our company. - Mike

nextworddev

4 months ago

Developers may want to deny this, but it's getting dangerously close to maybe replacing 30% of developers

simonw

4 months ago

I continue to believe that making developers 2-3x times more productive makes those developers 2-3x more valuable, and the smart thing for companies to do is to take on 2-3x times the amount of work, or hire MORE developers and finally start crunching through their inevitably years-long backlogs.

nextworddev

4 months ago

Your view doesn’t mesh with conversations I have had with most C-suite. Most firms outside of SV are seeing opportunities for cost reduction mostly.

And you can think through with first principles to see why it won’t expand developer hiring. Since AI progress is jagged, some industries will be affected in outsized ways while others may thrive more. But the increase in demand from new industries won’t absorb the reduction in demand from disrupted industries.

minimaxir

4 months ago

I like how in the demo video there's a squiggle emphasis on Claude's "Good Idea!" in response to a user clarification, when it's more common among vibe coders that that less glazing is better and they just want the LLM to write code.

robertwt7

4 months ago

This is very similar to Jules by Google! https://jules.google/

Although I wish that the performance of Jules is worse than Gemini CLI. I hope that this is as good as the Claude Code CLI.

lukaslalinsky

4 months ago

I wish this was integrated with GitHub actions, as there I can configure the environment, give it access to tools. The GitHub Actions integration is already fairly good, but having this interactive web UI would be perfect.

bgirard

4 months ago

Looks promising.

I got my environment working well with Codex's Cloud Task. Trying to same repo with Claude Code Web (which started off with Claude Code CLI mind you), and the yarn install just hangs with no debuggable output.

qwertox

4 months ago

I wish that a "Claude Code"-session (and a per-project-id session) could present itself in Claude Web in menu entries to create new chats which participate in the selected Claude Code session.

insane_dreamer

4 months ago

I can already run multiple parallel tasks with Claude in multiple terminal windows, with git worktrees if working on the same repo. So I don't really understand the use case for CC on the web.

scamaltman

4 months ago

How many comments in this topic are written by codex bot already ?

jakenuts

4 months ago

Terragon Labs does this but you can use both CC and Codex, it's a fantastic workflow for coding agents and GitHub all on your iPhone or desktop.

jannniii

4 months ago

I’m wondering if it would be possible to use the new skills feature or agents with this. Without the agents or the skills, I don’t know how useful this would be.

simonw

4 months ago

It's running Claude Code CLI on a container for you, so skills should just work. I've not tried them myself yet though.

cesarvarela

4 months ago

Does this work inside docker containers like Codex? Stuff like `testcontainers` is unusable with that architecture because you need access to docker itself.

lysecret

4 months ago

Yea it failed on testcontainers for me. The pnpm install worked fine though.

_pvzn

4 months ago

This is kind of nice, as much as I love a good TUI, sometimes text editing in claude code can trip me up compared to a web GUI

jakebasile

4 months ago

Both Anthropic and OpenAI have something like this now and neither bothered to implement a delete feature.

Google’s version Jules has one.

witnessme

4 months ago

Claude team has been killing it with the new impressive releases since last week. And this one looks most promising.

idk1

4 months ago

This is off topic, but can anyone tell me what the genre of music is to the video on this?

low_tech_punk

4 months ago

IMHO, parallel tasks across multiple repos is not as useful as parallel tasks in one repo.

CSMastermind

4 months ago

The inabillity to set up an environment with just full internet access is annoying.

kofman

4 months ago

You can specify * as the url whitelist. We’ll have a drop down for it shortly.

jzig

4 months ago

Does the feature need to be toggled on somewhere? I don’t see it on web nor iOS.

kofman

4 months ago

It’s currently only available to max and pro users. It will be available to enterprise and teams users soon!

TechDebtDevin

4 months ago

Literally could have stopped at claude 3.5 and nothing would be different.

mkummer

4 months ago

Is the web interface open sourced anywhere? Looks great, excited to try it out

hnidiots3

4 months ago

I wonder why people don’t just use Amp Code and use the Oracle.

It’s Sonnet 4.5 + GPT-5 working together.

Codex just isn’t as good as people make it out to be. OpenAI seems to train on a lot of JavaScript/Tailwind to make visuals look more impressive but when it comes to actual backend work it just fails more than it succeeds. Sonnet is much better at chewing through tasks and GPT 5 is great at consulting planning and analysis.

Using Amp and asking it to check everything with the oracle leads to superior results.

But no one on HN has heard of it. I’m guessing HN hates twitter?

Phlogistique

4 months ago

Because Claude is 20 bucks a month, Codex is 20 bucks a month, and any pay by token plan is way more expensive.

SalmoShalazar

4 months ago

Not sure what your twitter comment is about, I use it and I’ve just never heard of this product. Looks cool, I will give it a test.

hnidiots3

4 months ago

Might be who you follow but it seems everyone is talking about it more and more but I never see it mentioned on HN at all. So I figure the AI enthusiasts here aren’t on Twitter. (Just an assumption)

Worth trying out. The free version doesn’t have the oracle so I use the paid version.

arianvanp

4 months ago

The way network Access works really feels weird to me. I wish that instead i could just do it like (or with!) nix. If i know the hash of the thing I'm fetching from the network, allow me access to it. Instead of arbitrarily allow listing domains.

Imagine if this would just be able to use your nix file in your repo to fetch all the dependencies needed to run your project. That'd be extremely sick

kofman

4 months ago

Thanks for the suggestion! We’re working on making this smoother.

BohdanPetryshyn

4 months ago

They didn't even reviewed their own PR in the demo video :\

Stevvo

4 months ago

Guess they couldn't name it "Claude Codex"

aantix

4 months ago

Does this web interface have support for AWS Bedrock?

mvandermeulen

4 months ago

Unfortunately Anthropic have completely lost my trust. It’s very unlikely that I will ever return to purchasing from a company that behaves in the manner in which they do.

_ink_

4 months ago

Why?

kelvinjps10

4 months ago

I was hoping that it would work with the API.

user

4 months ago

[deleted]

user

4 months ago

[deleted]

retrocog

4 months ago

My productivity is exploding!

lvl155

4 months ago

I am not a big fan of these. They’re trying to bundle compute and jack up the prices down the road.

jimmydoe

4 months ago

not sure why you got downvoted but adding all these bells and whistles are the way to increase the price, which in turn justify the huge investment.

without 200% price increase in 3 years, there's no way any of these AI companies will survive.

kingstnap

4 months ago

The YouTube demo is hilarious.

https://youtu.be/s-avRazvmLg?si=eQqY6w8kbxv3TFhQ

The dev just types in a prompt, scrolls down the bottom and makes a PR asking others to review without even looking at what they just did.

Lmao. Know their target market for sure.

wolfgangbabad

4 months ago

Codex is the way.

rounakdatta

4 months ago

Now that this hosted CC is achieved, next up, I think scheduled workflows would be coming. For example, certain open-source repositories host data files scraped regularly from sources via scheduled GitHub Actions, that could be simplified.

cw00h

4 months ago

[dead]

kdy1

4 months ago

[dead]

outofsafezone

4 months ago

[dead]

bgwalter

4 months ago

I have never seen such a bunch of uncreative people who have never written a real application, never done anything artistic, never said anything intelligent try to ruin software development to the extent that the "AI" companies do.

They want to turn everything into a bootstrap framework, which is probably the limit of their mental horizon. And many people maintain that the emperor is fully clothed and that the scam works.

cheema33

4 months ago

Your comment has a bit of the "old man screams at clouds" feel to it.

bgwalter

4 months ago

You're absolutely right and thank you for your correction! Clouds are a legitimate IT business, and it would be a factual error to dismiss them. I apologize for the error.

Would you like me to print a chart of the growth of cloud businesses from 2010-2025?