hackernews client

GPT-5.3-Codex

1497 pointsposted 2 days ago

589 Comments

Rperry2174

2 days ago

Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically

With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

that feels like a reflection of a real split in how people think llm-based coding should work...

some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result

Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.

Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means

karmasimida

PeterStuer

16 hours ago

Easiest way I know is to just use LMStudio. Just download and press play :). Optional, but recommended, increase the context length to 262144 if you have the DRAM available. It will definitely get slower as your interaction prolongs, but (at least for me) still tolerable speed.

mathrawka

16 hours ago

a day ago

The problem is if you're using subagents, the only way to interject is often to press escape multiple times which kills all the running subagents. All I wanted to do was add a minor steering guideline.

This might be better with the new teams feature.

Skwrm

21 hours ago

They actually made a change a few weeks ago that made subagents more steerable

When they ask approval for a tool call, press down til the selector is on "No" and press tab, then you can add any extra instructions

cruffle_duffle

a day ago

That is so annoying too because it basically throws away all the work the subagent did.

2 days ago

> it's a waste of time to steer them

It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.

There's a balance to strike between micro-management and no steering at all.

adw

a day ago

The prompt is decreasingly relevant. The verification environment you have is what actually matters.

freakynit

a day ago

I think this all comes down to information.

Most prompts we give are severely information-deficient. The reason LLMs can still produce acceptable results is because they compensate with their prior training and background knowledge.

The same applies to verification: it's fundamentally an information problem.

You see this exact dynamic when delegating work to humans. That's why good teams rely on extremely detailed specs. It's all a game of information.

adrianN

19 hours ago

Having prompts be information deficient is the whole point of LLMs. The only complete description of a typical programming problem is the final code or an equivalent formal specification.

freakynit

10 hours ago

Exactly the point. But, LLM's miss that human intuition part.

bcarv

2 days ago

I don't think AI for programming is a passing fad

jondwillis

21 hours ago

Who hurt you?

Also what are you even proposing/advocating for here?

This meta-state-of-company context is just as capturable as anything else with the right lines of questioning and spyware and UI/UX to elicit it.

rapind

a day ago

Maybe some day, but as a claude code user it makes enough pretty serious screw ups, even with a very clearly defined plan, that I review everything it produces.

You might be able to get away without the review step for a bit, but eventually (and not long) you will be bitten.

jaggederest

a day ago

I use that to feed back into my spec development and prompting and CI harnesses, not steering in real time.

Every mistake is a chance to fix the system so that mistake is less likely or impossible.

I rarely fix anything in real time - you review, see issues, fix them in the spec, reset the branch back to zero and try again. Generally, the spec is the part I develop interactively, and then set it loose to go crazy.

This feels, initially, incredibly painful. You're no longer developing software, you're doing therapy for robots. But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.

Terretta

a day ago

> You're no longer developing software, you're doing therapy for robots.

Or, really, hacking in "learning", building your knowhow-base.

> But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.

Strong yes to both, so strong that it's curious Claude Code, Codex, Claude Cowork, etc., don't yet bake in an explicit knowledge evolution agent curating and evolving their markdown knowledge base:

https://github.com/anthropics/knowledge-work-plugins

Unlikely to help with benchmarks. Very likely to improve utility ratings (as rated by outcome improvements over time) from teams using the tools together.

For those following along at home:

This is the return of the "expert system", now running on a generalized "expert system machine".

rapind

a day ago

I assumed you'd build such a massive set of rules (that claude often does not obey) that you'd eat up your context very quickly. I've actually removed all plugins / MCPs because they chewed up way too much context.

jaggederest

a day ago

It's as much about what to remove as what to add. Curation is the key. Skills also give you some levers to get the kind of context-sensitive instruction you need, though I haven't delved too deeply into them. My current total instruction set is around ~2500 tokens at the moment

vidarh

20 hours ago

Reviewing what it produces once it thinks it has met the acceptance criteria and the test suite passes is very different from wasting time babysitting every tiny change.

rapind

15 hours ago

True, and that's usually what I'm doing now, but to be honest I'm also giving all of it's code at least a cursory glance.

Some of the things it occasionally does:

- Ignores conventions (even when emphasized in the CLAUDE.md)

- Decides to just not implement tests if gets spins out on them too much (it tells you, but only as it happens and that scrolls by pretty quick)

- Writes badly performing code (N+1)

- Does more than you asked (in a bad way, changing UIs or adding cruft)

- Makes generally bad assumptions

I'm not trying to be overly negative, but in my experience to date, you still need to babysit it. I'm interested though in the idea of using multiple models to have them perform independent reviews to at least flag spots that could use human intervention / review.

vidarh

37 minutes ago

Sure, but non of those things requires you to watch it work. They're all easy to pick up on when reviewing a finished change, which ideally should come after it's instructions have had it run linters, run sub agents that verify it has added tests, run sub agents doing a code review.

I don't want to waste my time reviewing a change the model can still significantly improve all by itself. My time costs far more than the models.

_zoltan_

2 hours ago

then you're using it wrong, to be frank with you.

you give it tools so it can compile and run the code. then you give it more tools so it can decide between iterations if it got closer to the goal or not. let it evaluate itself. if it can't evaluate something, let it write tests and benchmark itself.

I guarantee that if the criteria is very well defined and benchmarkable, it will do the right thing in X iterations.

(I don't do UI development. I do end-to-end system performance on two very large code bases. my tests can be measured. the measure is very simply binary: better or not. it works.)

halfcat

a day ago

> given the proper framing

This sounds like never. Most businesses are still shuffling paper and couldn’t give you the requirements for a CRUD app if their lives depended on it.

You’re right, in theory, but it’s like saying you could predict the future if you could just model the universe in perfect detail. But it’s not possible, even in theory.

If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.

If you can’t fully describe the thing, like some general “make more profit” or “lower costs”, you’re in paper clip maximizer territory.

jondwillis

21 hours ago

> If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.

Trying to get my company to realize this right now.

Probably the most efficient way to work, would be on a video call including the product person/stakeholder, designer, and me, the one responsible for the actual code, so that we can churn through the now incredibly fast and cheap implementation step together in pure alignment.

You could probably do it async but it’s so much faster to not have to keep waiting for one another.

retinaros

14 hours ago

good luck.

_zoltan_

2 hours ago

I've been working on very complex problems with this model and the results I have have surprised people over and over again.

xXSLAYERXx

21 hours ago

I've been using codex for one week and I have been the most productive I have ever been. Small prs, tight rules, I get almost exactly what I want. Things tend to go sideways when scope creeps into my request. But I just close the PR instead of fighting with the agent. In one week: 28 prs, 26 merged. Absolutely unreal.

vidarh

20 hours ago

I will personally never consider using an agent that can't be easily pushed toward working on its own for long periods (hours) at a time. It's a total waste of time for me to babysit the LLM.

sejje

2 days ago

yeah I'm mostly just talking about how they're framing it: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user"

I guess its also quite interesting that how they are framing these projects are opposite from how people currently perceive them and I guess that may be a conscious choice...

giancarlostoro

2 days ago

I get what you mean now, I like that to be fair, sometimes I want Claude to tell me some architectural options, so I ask it so I can think about what my options are, sometimes I rethink my problem if I like Claudes conclusion.

jhancock

2 days ago

Good breakdown.

I usually want the codex approach for code/product "shaping" iteratively with the ai.

Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.

I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.

techbro_1a

2 days ago

> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5

bob1029

2 days ago

I think there is another philosophy where the agent is domain specific. Not that we have to invent an entirely new universe for every product or business, but that there is a small amount of semi-customization involved to achieve an ideal agent.

2 days ago

2 days ago

I am definitely using Opus as an interactive collaborator that I steer mid-execution, stay in the loop and course correct as it works.

I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.

granzymes

2 days ago

I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

GPT-5.3-codex scores 77.3.

the_duke

2 days ago

I do not trust the AI benchmarks much, they often do not line up with my experience.

a day ago

Not the OP, but I use the same approach.

a day ago

You should check out the PAL MCP and then also use this process, its super solid: https://github.com/glittercowboy/get-shit-done

The way "Phases" are handled is incredible with research then planning, then execution and no context rot because behind the scenes everything is being saved in a State.md file...

I'm on Phase 41 of my own project and the reliability and almost absence of any error is amazing. Investigate and see if its a fit for you. The PAL MCP you can setup to have Gemini with its large context review what Claude codes.

aurareturn

2 days ago

5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

koakuma-chan

2 days ago

Opus 4.5 is more creative and better at making UIs

hypercube33

a day ago

Unless it's scroll bar theming then my God it's bad. it told me it gives up. Gemini 3 got stuck but the right prompt it did work.

fooker

2 days ago

Yeah, these benchmarks are bogus.

Cost to Run Artificial Analysis Intelligence Index:

GPT-5.2 Codex (xhigh): $3244

2 days ago

A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.

The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...

I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.

janalsncm

2 days ago

More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.

As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.

nurettin

2 days ago

semiinfinitely

a day ago

much stronger

Xunjin

a day ago

Don't forget that is also Harder, Better, Faster.

2 days ago

Please no, I don’t need my quick prototypes hardened against every perceivable threat.

comex

So far these look like successful predictions.

ainch

2 days ago

Moltbot is an attempt to do that. Would you hire it as a personal secretary and entrust all your personal data to it?

danpalmer

2 days ago

Only people who haven't had a secretary would think it's a personal secretary.

Like, it can't even answer the phone.

fragmede

a day ago

There are plenty of companies that sell an AI assistant that answers the phone as a service, they just aren't named OpenAI or Anthropic. They'll let callers book an appointment onto your calendar, even!

danpalmer

a day ago

No, there are companies that sell voice activated phone trees, but no one is getting results out of unstructured, arbitrary phone call answering with actions taken by an LLM.

I'm sure there are research demos in big companies, I'm sure some AI bro has done this with the Twilio API, but no one is seriously doing this.

All it takes is one "can you take this to the post office", the simplest, of requests, and you're in a dead end of at best refusal, but more likely role-play.

PranayKumarJain

a day ago

Agreed that “unstructured arbitrary phone calls + arbitrary actions” is where things go to die.

What does work in production (at least for SMB/customer-support style calls) is making the problem less magical: 1) narrow domain + explicit capabilities (book/reschedule/cancel, take a message, basic FAQs) 2) strict tool whitelist + typed schemas + confirmations for side effects 3) robust out-of-scope detection + graceful handoff (“I can’t do that, but I can X/Y/Z”) 4) real logs + eval/test harnesses so regressions get caught

Once you do that, you can get genuinely useful outcomes without the role-play traps you’re describing.

We’ve been building this at eboo.ai (voice agents for businesses). If you’re curious, happy to share the guardrails/eval setup we’ve found most effective.

fragmede

a day ago

https://www.instagram.com/p/DMfpj0hM7e0/

is obviously a staged demo but it seems pretty serious for him. He's wearing a suit and everything!

https://www.instagram.com/p/DK8fmYzpE1E/

seems like research by some dude (no disrespect, he doesn't seems like he's at big company though).

https://www.instagram.com/p/DH6EaACR5-f/

could be astroturf, but seems maybe a little bit serious.

Davidzheng

a day ago

It's important to remember though (this is besides the point for what you're saying) that job displacement of things like secretaries from AI do not require it to be a near perfect replacement. There are many other factors (for example if it's much cheaper and can do part of the work it can dramatically shrink demand as people can shift to an imperfect replacement in AI)

Rudybega

2 days ago

I think they immediately corrected their median timelines for takeoff to 2028 upon releasing the article (I believe there was a math mistake or something initially), so all those dates can probably be bumped back a few months. Regardless, the trend seems fairly on track.

speed_spread

a day ago

People have been in love with machines for a long time. It's just that the machines didn't talk back so we didn't grant them the "partner" status. Wait for car+LLM and you'll have a killer combo.

fragmede

21 hours ago

KITT, is that you?

YawningAngel

2 days ago

I don't think generative AI is even close to making model development 50% faster

So the perceived rate of change might be linear.

It's definitely true for some things such as wealth:

- $2000 is a lot of you have $1000.

- It's a substantial improvement of you have $10000.

- It's not a lot you have $1m

- It does not matter if you have $1b

varjag

a day ago

$2000 is not substantial over $1b on the linear scale

gf000

2 days ago

If it's exponential growth. It may just as well be some slow growth and continue to be so.

thrance

2 days ago

I think the limiting factor is capital, not code. And I doubt GPTX is anymore competent at raising funds than the other, fleshy, snake oilers...

aaaalone

2 days ago

I'm only saying no to keep optimistic tbh

manmal

2 days ago

I guess humans were involved in all that, so how is that anything but tool use?

8note

2 days ago

making the specifications is still hard, and checking how well results match against specifications is still hard.

2 days ago

> As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.

The implicit assumption here is that we have constructed our laws so skillfully that the only path to win a free market competition is by producing a better product, or that all efforts will be spent doing so. This is never the case. It should be self-evident from this that there is a more productive way for companies to compete and our laws are not sufficient to create the conditions.

thethimble

2 days ago

The consumers are getting huge wins.

Model costs continue to collapse while capability improves.

Competition is fantastic.

mrandish

2 days ago

> The consumers are getting huge wins.

However, the investors currently subsidizing those wins to below cost may be getting huge losses.

credit_guy

a day ago

Yes, but that's the nature of the game, and they know it.

doom2

a day ago

You were downvoted but I don't understand why. This is the purpose/spirit of antitrust law [1]

[1] https://en.wikipedia.org/wiki/United_States_antitrust_law

manquer

a day ago

I have long since given up trying to understand voting patterns in HN :)

---

Sadly it was the core of anti-trust law, since 1970s things have changed.

The predominant view today (i.e. Chicago School view) in both judiciary and executive are influenced by Justice Bork's ideas that consumer benefit being the deciding factor over company's actions.

Consumer benefits becomes opinions of projections by either side of a case about the future, whereas company actions like collusion, pricing fixing or M&A are hard facts with strong evidence. Today it is all vibes on how the courts (or executive) feel .

So now we have Government sanctioned cartels like in Aviation Alliances [1] that is basically based on convoluted catch-22-esque reasoning because it favors strategic goals even though it would be a violation of the letter/spirit of the law.

[1] https://www.transportation.gov/office-policy/aviation-policy...

IhateAI

2 days ago

A sign of the inevitible implosion !

cedws

2 days ago

I wish they’d just stop pretending to care about safety, other than a few researchers at the top they care about safety only as long as they aren’t losing ground to the competition. Game theory guarantees the AI labs will do what it takes to ensure survival. Only regulation can enforce the limits, self policing won’t work when money is involved.

thethimble

2 days ago

As long as China continues to blitz forward, regulation is a direct path to losing.

cedws

2 days ago

Define "losing."

Europe is prematurely regarded as having lost the AI race. And yet a large portion of Europe live higher quality lives compared to their American counterparts, live longer, and don't have to worry about an elected orange unleashing brutality on them.

thethimble

2 days ago

If the world is built on AI infrastructure (models, compute, etc.) that is controlled by the CCP then the west has effectively lost.

This may lead to better life outcomes, but if the west doesn't control the whole stack then they have lost their sovereignty.

This is already playing out today as Europe is dependent on the US for critical tech infrastructure (cloud, mail, messaging, social media, AI, etc). There's no home grown European alternatives because Europe has failed to create an economic environment to assure its technical sovereignty.

fakedang

2 days ago

Europe has already lost the tech race - their cloud systems that their entire welfare states rely upon are all hosted on servers hosted by American private companies, which can turn them off with a flick of a switch if and when needed.

When the welfare state, enabled by technology, falls apart, it won't take long for European society to fall apart. Except France maybe.

clows

a day ago

welfare state enabled by cloud services/technology?

I'm not sure if you know less about europe or tech.

> Except France maybe.

sure

pixl97

2 days ago

You mean all paths are direct paths to losing.

vovavili

2 days ago

The last thing I would want is for excessively neurotic bureaucrats to interfere with all the mind-blowing progress we've had in the last couple of years with LLM technology.

SunshineTheCat

2 days ago

I've always been fascinated to see significantly more people talking about using Claude than I see people talking about Codex.

I know that's anecdotal, but it just seems Claude is often the default.

I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.

However, the note I see the most from Claude users is running out of usage.

Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).

That (at least to me) seems to be a much bigger deal than coding nuances.

timpera

2 days ago

In my experience, OpenAI gives you unreasonable amounts of compute for €20/month. I am subscribed to both and Claude's limits are so tiny compared to ChatGPT's that it often feels like a rip-off.

Claude also doesn't let you use a worse model after you reach your usage limits, which is a bit hard to swallow when you're paying for the service.

replwoacause

a day ago

Same experience here. I started our devoutly using Claude but ran into some many limits that I switched back to ChatGPT and it's been night and day. I haven't even really been able to play with the Opus model on my Pro plan because it devours usage and then blocks me for X hours until it resets, costing me a work day. OpenAI has never done that to me. In fact, Codex just churned away for 2 hours on a task and I'm still using it without hitting a limit. I used to love using Claude but the limits are too prohibitive.

appsoftware

19 hours ago

Claude when used via Github Co-Pilot is much better for useage allowance. I used Opus 4.5 for a months worth of development and only just hit 90 pct of the pro $40 per month allowance.

lm28469

a day ago

If their pay as you go api token prices reflect their internal costs then it makes sense, but it could also be that claude makes money while gpt sells at loss to stay on top. Claude is way more expensive overall, and way more limited with flat rate subscriptions

opus: 5/25 gpt: 1.75/14

I think it really is just a stylistic preference. But the Claude people seem convinced Claude is better. Having spent a bunch of time analyzing both I just don’t see it.

AstroBen

2 days ago

gspetr

a day ago

Any estiimates on how much it cost you? In terms of total real world time, money, and time spent by the agents.

bgirard

a day ago

About ~$300: $200 for Claude max subscription $20 for Vercel $20 for Codex $20 for Meshy

I think these days the $200 Max subscription wouldn't be needed. I bet with these latest models you can make due with mixing two $20/mo subscriptions.

Real time was 2 weeks of watching the agents while watching TV and playing games, waiting for limit resets, etc... Very little decided focused time.

tosh

2 days ago

Terminal Bench 2.0

  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |

greenfish6

2 days ago

yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding

falloutx

2 days ago

When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.

apetresc

a day ago

I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.

manmal

a day ago

Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.

AstroBen

2 days ago

'feel' is no more accurate

not saying there's a better way but both suck

thethimble

2 days ago

Speak for yourself. I've been insanely productive with Codex 5.2.

With the right scaffolding these models are able to perform serious work at high quality levels.

helloplanets

2 days ago

I don’t think this is even remotely true in practice.

I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.

The idea that all models have very close performance across all domains is a moderately insane take.

At any given moment the best model for my actual projects and my actual work varies.

Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.

2 days ago

In the 1930s, when electronic calculators were first introduced, there was a widespread belief that accounting as a career was finished. Instead, the opposite became true. Accounting as a profession grew, becoming far more analytical/strategic than it had been previously.

You are correct that these models primarily address problems that have already been solved. However, that has always been the case for the majority of technical challenges. Before LLMs, we would often spend days searching Stack Overflow to find and adapt the right solution.

Another way to look at this is through the lens of problem decomposition as well. If a complex problem is a collection of sub-problems, receiving immediate solutions for those components accelerates the path to the final result.

For example, I was recently struggling with a UI feature where I wanted cards to follow a fan-like arc. I couldn't quite get the implementation right until I gave it to Gemini. It didn't solve the entire problem for me, but it suggested an approach involving polar coordinates and sine/cosine values. I was able to take that foundational logic turn it into a feature I wanted.

Was it a 100x productivity gain? No. But it was easily a 2x gain, because it replaced hours of searching and waiting for a mental breakthrough with immediate direction.

There was also a relevant thread on Hacker News recently regarding "vibe coding":

2 days ago

The downside is that you miss the chance to brush up on your math skills, skills that could help you understand and express more complicated requirements.

...This may still be worth it. In any case it will stop being a problem once the human is completely out of the loop.

edit: but personally I hate missing out on the chance to learn something.

pkoiralap

a day ago

That would indeed be the case if one has never learned the stuff. And I am all in for not using AI/LLM for homework/assignments. I don't know about others, but when I was in school, they didn't let us use calculators in exams.

Today, I know very well how to multiply 98123948 and 109823593 by hand. That doesn't mean I will do it by hand if I have a calculator handy.

Also, ancient scholars, most notably Socrates via Plato, opposed writing because they believed it would weaken human memory, create false wisdom, and stifle interactive dialogue. But hey, turns out you learn better if you write and practice.

fragmede

a day ago

In later classes in school, the calculator itself didn't help. If you didn't know the material well enough, you didn't know what to put into the calculator.

xandrius

2 days ago

Why even come to this site if you're so anti-innovation?

Today with LLMs you can literally spend 5 minutes defining what you want to get, press send, go grab a coffee and come back to a working POC of something, in literally any programming language.

This is literally stuff of wonders and magic that redefines how we interface with computers and code. And the only thing you can think of is to ask if it can do something completely novel (that it's so hard to even quantity for humans that we don't have software patents mainly for that reason).

And the same model can also answer you if you ask it about maths, making you an itinerary or a recipe for lasagnas. C'mon now.

cowl

a day ago

Agree but you are talking about a POC, and he is talking about reliable, working software. this phase of LLM are perfect for POCs and there you can have 10x speedup, no question. But going from a POC to a working reliable software is where most of our time is spent anyway even without LLMS.

With LLMs this phase becomes worse. we speedup 10x the poc time, we slow down almost as much in the next phases, because now you have a poc of 10k lines that you are not familiar with at all, that have to pay way more attention at code review, that have to bolt on security as an afterthought (a major slowdown now, so much so that there are dedicated companies whose business model has become fixing Security problems caused by LLM POCs). Next phase, POCs are almost always 99% happy path. Bolt on edge case as another after thought and because you did not write any of those 10k lines how do you even know what edge cases might be neccesary to cover? maybe you guessed it rigth, spend even more time studing the unfamiliar code.

We use LLM extensivly now in our day to day, development has become somewhat more enjoyable but there is, at least as of now, no real increase in final delivry times, we have just redestributed where effort and time goes.

xandrius

a day ago

At our company we use AI extensively to see if we missed edge cases and it does a pretty good job in pointing us towards places which could be handled better.

I know we all think we are always so deep into absolutely novel territory, which only our beautiful mind can solve. But for the vast majority of work done in the world, that work is transformative. You take X + Y and you get Z. Even with brand new api, you can just slap in the documentation and navigate it in order of magnitude faster than without.

I started using it for embedded systems doing something which I could literally find nothing about in rust but plenty in arduino/C code. The LLM allowed me to make that process so much faster.

manmal

a day ago

> no real increase in final delivry times

That’s not true though. The ability to de-risk concepts within a day instead of weeks will speed up the timeline tremendously.

legulere

2 days ago

I don't think that the user you are responding to is anti-innovation, but rather points out that the usefulness of AI is oversold.

I'm using Copilot for Visual Studio at work. It is useful for me to speed some typing up using the auto-complete. On the other hand in agentic mode it fails to follow simple basic orders, and needs hand-holding to run. This might not be the most bleeding-edge setup, but the discrepancy between how it's sold and how much it actually helps for me is very real.

ifwinterco

2 days ago

I think copilot is widely considered to be fairly rubbish, your description of agentic coding was also my experience prior to ~Q3 2025, but things have shifted meaningfully since then

pcloadlett3r

2 days ago

Copilot has access to the latest models like Opus 4.6 in agentic mode as well. It's got certain quirks and I prefer a TUI myself but it isn't radically different.

fragmede

a day ago

Even at Microsoft they're using Claude Code over Copilot, so I think it's different enough.

Rapzid

a day ago

You are so behind the curve if you think copilot is mostly rubbish. That's a 4+ month old take.

As maybe a more poignant example- I used to do a lot of on-campus recruiting when I worked in HFT, and I think I disappointed a lot of people when I told them my day to day was pretty mundane and consisted of banging out Jiras, usually to support new exchanges, and/or securities we hadn't traded previously. 3% excitement, 97% unit tests and covering corner cases.

turblety

2 days ago

I'm not sure if you'd call it a productivity gain, but I have to host our infrastructure on a system that runs processes entirely in Linux userland.

To bridge the containers in userland only, without root, I had to build: https://github.com/puzed/wrapguard

I'm sure it's not perfect, and I'm sure there are lots of performance/productivity gains that can be made, but it's allowed us to connect our CDN based containers (which don't have root) across multiple regions, talking to each other on the same Wireguard network.

No product existed that I could find to do this (at least none I could find), and I could never build this (within the timeframe) without the help of AI.

epolanski

a day ago

I know for a fact I deliver more and at higher quality and while being less tired. Mental energy is also a huge factor, because after digging in code for half a day i'd be exhausted.

People should stop focusing on vibecoding and realize how many things LLMs can do such as investigating messy codebases that took me ages of writing paper notes to connect the dots, finding information about dependencies just by giving them access to replacing painful googling and GitHub issues or outdated documentation digging, etc.

Hell I can jump in projects I know nothing about, copy paste a Jira ticket, investigate, have it write notes, ask questions and in two hours I'm ready to implement with very clear ideas about what's going on. That was multi day work till few years ago.

I can also have it investigate the task at hand and automatically find the many unknowns unknowns that as usual work tasks have, which means cutting deliveries and higher quality software. Getting feedback early is important.

LLMs are super useful even if you don't make them author a single line of code.

And yes, they are increasingly good at writing boilerplate if you have a nice and well documented codebase thus sparing you time. And in my career I've written tons of mostly boilerplate code, that was another api, another form, another table.

And no, this is not vibe coding. I review every single line, I use all of its failures to write better architectural and coding practices docs which further improves the output at each iteration.

Honestly I just don't get how people can miss the huge productivity bonus you get, even if you don't have it edit a singl line of code.

revahage

2 days ago

beernet

2 days ago

Can you point me to a human written program an LLM cannot write? And no, just answering with a massively large codebase does not count because this issue is temporary.

Some people just hate progress.

HAL3000

2 days ago

> Can you point me to a human written program an LLM cannot write?

Sure:

"The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)"[1]

1. https://www.anthropic.com/engineering/building-c-compiler

svantana

2 days ago

Pretty much any software that people pay for? If LLMs could clone an app, why would anyone still pay good money for the original?

falloutx

2 days ago

Even a normal website like landonorris.com. Try copying all those effects with AI.

Another example: Red Dead Redemption 2

Another one: Roller coaster tycoon

Another one: ShaderToy

avaer

2 days ago

I wish I could agree with you, but as a game dev, shader author, and occasional asm hacker, I still think AIs have demonstrated being perfectly capable of copying "those effects". It's been trained on them, of course.

You're not gonna one-shot RD2, but neither will a human. You can one-shot particles and shader passes though.

falloutx

2 days ago

I didnt say one shot it, coding agents have been out for more than couple years and yet we cant point to single Good piece of software built by it.

Davidzheng

a day ago

"coding agents have been out for more than couple years"?????

[0] https://youtu.be/f2FnYRP5kC4

falloutx

2 days ago

you have shown me 0 examples, I showed actual examples to the given question. Your answers have just been "AI can also do this" but gave no actual proof.

satvikpendem

2 days ago

The examples are in the video I linked, as I said, if you don't bother to watch it then I'm not sure what to tell you. As I said for games I don't know and won't presume to search up some random vibe coded game if I don't have personal experience with how LLMs handle games, but for web development, the sites I've made and seen made look pretty good.

18 hours ago

Actually, LLM fiction writing is awfully bad. But it does help with ideas!

I'm trying my hardest to make it feel high quality instead of just slop.

jason_oster

a day ago

Personally, I’ve only been using a coding agent for a few months infrequently, so I have nothing to show for it. (It is not 100x productivity, that’s absurd.)

But I have plenty of examples of really atrocious human written code to show you! TheDailyWtf has been documenting the phenomenon for decades.

eviks

2 days ago

Great question, here is the link from the future:

silveraxe93

a day ago

Yeah, Claude Code.

xvector

a day ago

I work for a big tech company, most of our code today is written by agents. This includes backend infra and frontend app/UX code.

It satisfies your relevant criteria: LLM-written, reliable, non-trivial.

No major program is perfectly reliable so I wouldn't call it that (but we have fewer incidents vs human-written code), and "useful" is up to the reader (but our code is certainly useful to us.)

logicprog

a day ago

> single useful ... preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code

I see this originality criteria appended a lot, and

1) I don't think it's representative of the actual requirements for something to be extremely useful and productivity-enhancing, even revolutionary, for programming. IDE features, testing, code generation, compilers — all of these things did not really directly help you produce more original solutions to original problems, and yet they were huge advances in program or productivity.

I mean like. How many such programs are there in general?

The vast vast majority of programs that are written are slight modifications, reorganizations, or extensions, of one or more programs that are already publicly available a bunch of times over.

Even the ones that aren't could fairly easily be considered just recombinations of different pieces of programs that have been written and are publicly available dozens or more times over, just different parts of them combined in a different order.

Hell, most code is a reorganization or recombination of the exact same types of patterns just in a different way corresponding to different business logic or algorithms, if you want to push it that far.

And yet plenty of deeply unoriginal programs are very useful and fill a useful niche, so they get written anyway.

2) Nor is it a particularly satisfiable goal. If there aren't, as a percentage, very many reliable, useful, and original programs that have been written in the decades since open source became a thing, why would we expect a five-year-old technology to have done so, especially when, obviously, the more reliable original and broadly useful programs have already written, the narrower the scope for new ones to satisfy the originality criteria?

3) Nor is it actually something that we would expect even under the hypothesis that agents make people significantly more productive at programs. Even if agents give 100x productivity gains to writing a useful tool or service or program or improving existing ones with new features. We still wouldn't expect them to give necessarily very many much productivity gains at all to writing original programs, precisely because of their current technology is a product of deep thinking, understanding a specific domain, seeing a niche, inspiration, science, talent and luck much more than the ability to even do productive engineering.

Rover222

a day ago

Not 100x but absolutely 4x to 5x increase in productivity for everyone on team on a large enterprise codebase that serves the military a lot of serious clients.

To deny at least that level of productivity at this point, you have to have your head in the sand.

mrcwinn

a day ago

No, but I have seen privately available code that matches this description.

jstummbillig

a day ago

It's so interesting that I start to feel a change, that is developing as a separate thing to capability. Previously, yeah sure, things changed but models got so outrageously better at the basic things that I simply wouldn't care.

Now... increasingly it's like changing a partner just so slightly. I can feel that something is different and it gives me pause. That's probably not a sign of the improvement diminishing. Maybe more so my capability to appreciate them.

I can see how one might get from here to the whole people being upset about 4o thing.

RivieraKid

2 days ago

Do software engineers here feel threatened by this? I certainly am. I'm surprised that this topic is almost entirely missing in these threads.

AstroBen

2 days ago

No. It turns into a complete mess without someone that knows what they're doing to steer it. It's an upgrade to autocomplete

energy123

a day ago

Unless you're retiring in less than 5 years this is extremely short sighted.

lurking_swe

a day ago

It’s also silly to try predicting the future 5 years from now, IMO. Historically progress is very unpredictable. It often plateaus when you least expect it.

It’s good to be cautious and not in denial, but i usually ignore people who talk so authoritatively about the future. It’s just a waste of time. Everyone thinks they are right.

My recommendation is have a very generous emergency fund and do your best to be effective at work. That’s the only thing you can control and the only thing that matters.

xvector

a day ago

Or just move into technical leadership or management/executive permissions.

In any case, everyone should be riding the AI wave! Anyone doing so should have enough to retire five years from now.

benry1

a day ago

I'm reading Maintenance of Everything and it has a section about the switch from artisan-crafted weapons to making uniform parts that feels comparable to this.

French military had pioneered a way to make fully interchangeable weapon parts, but the French public fought back in fear of the jobs of the artisans who used to hand-make weapons. Over the next 20 years they completely lost their edge on the battlefield, nothing could be repaired in the field. Other countries embraced the change, could repair anything in the field with cheap and precise spare parts, and soon fostered in the industrial revolution.

The artisans stopped being people who made weapons, the artisans became people who made machines that made weapons.

energy123

7 hours ago

> The artisans stopped being people who made weapons, the artisans became people who made machines that made weapons.

Although many French artisans become unemployed because British industrial productivity made them uncompetitive. It was one of the causes of the French Revolution.

a day ago

> People still pay thousands of dollars for wedding photographers even though everyone at the wedding also has a camera and many are taking their own pictures.

Wedding photography used to be the lowest in the pecking order of professional photography. Now all the photojournalists, travel magazine and corporate events photographers are as good as extinct. Even the arts market for photography been on decline for years.

AstroBen

19 hours ago

> I guess your premise is mass unemployment is not worth worrying about, so okay then

My point wasn't that it's not a big deal. My point there is that if AI ends up taking a large % of white collar work you're going to have a huge portion of the population in the same boat. Maybe an overly optimistic view but that'll end up forcing change through politics

..I also think this is a ridiculously low % chance of happening and it would take something close to AGI to bring about. I don't know how you can use AI regularly and think we're anywhere close to that

Contracting an incurable illness that renders me blind and thus unable to work is just as likely and not something I spend time worrying about

> Marginal changes in productivity can make huge impacts to industries employment rates

Maybe? We also have Jevon's paradox. Software is incredibly expensive to build right now - how many more applications for it can people find if the cost halves?

netdevphoenix

a day ago

> I'm struggling to think of any scenario that doesn't also put most white collar professions out of work alongside me

anthonypasq

19 hours ago

i think i fundamentally agree that the demand for code is essentially infinite. Code has just been notoriously expensive and therefore it could only de deployed towards the most economically efficient activities. this is now changing.

worldsavior

2 days ago

No. AI does not work well enough, you still need a person to look on it and CODE. It probably never will, until AGI which probably also in my opinion will never come.

dude250711

2 days ago

It's a super-special AI tier that can replace developers and other grunts, yet somehow cannot replace managers and C-suite.

It can only replace whoever is not writing a fat cheque to it.

nickandbro

20 hours ago

I have found GPT 5.3-Codex to do exceedingly well when working with graphics rendering pipelines. They must have better training data or RL approaches than Antropic as I have given the same prompt and config to Opus 4.6 and it seems to have added unwanted rendering artifacts. This may be just an issue specific to my use case, but wonder since OpenAI is partners with MSFT, which makes lots of games, that this may be an area they heavily invested in

fishpham

2 days ago

Model card: https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc6...

zen4ttitude

4 hours ago

Does anyone know more about the benchmark? 60% accuracy gets a drumroll? How would Claude do? How would a human do? I tried the previous version and was not impressed. I went back to Claude that is very hard to beat, and versatile with context enrichment.

trilogic

2 days ago

When 2 multi billion giants advertise same day, it is not competition but rather a sign of struggle and survival. With all the power of the "best artificial intelligence" at your disposition, and a lot of capital also all the brilliant minds, THIS IS WHAT YOU COULD COME UP WITH?

Interesting

sdf2erf

2 days ago

Yeah they are both fighting for survival. No surprise really.

Need to keep the hype going if they are both IPO'ing later this year.

thethimble

2 days ago

Google didn't announce $185 billion in capex to do cataloguing and flash cards.

causalmodels

2 days ago

Google didn't buy 30% of Anthropic to starve them of compute

WarmWash

2 days ago

Probably why it's selling them TPUs.

riku_iki

2 days ago

> is new datasets + post-training shaping the model's behavior (instruction + preference tuning). There is no moat besides that.

sure, but acquiring/generating/creating/curating so much high quality data is still significant moat.

tombert

2 days ago

Actually kind of excited for this. I've been using 5.2 for awhile now, and it's already pretty impressive if you set the context window to "high".

2 days ago

Using opus 4.6 in claude code right now. It's taking about 5x longer to think things through, if not more.

a day ago

How come that OpenAI and Anthropic both released their models pretty much at the same time? Does anyone know if the timing is coincidental?

phil917

18 hours ago

I would bet to be ready before the Superbowl ads

energy123

a day ago

First impression: It's much faster for the same task.

When they hook it up to Cerebras it's going to be a head-exploding moment.

gallerdude

2 days ago

Both Opus 4.6 and GPT-5.3 one shot a Gameboy emulator for me. Guess I need a better benchmark.

paxys

2 days ago

As coding agents get "good enough" the next differentiator will be which one can complete a task in fewer tokens.

tgtweak

2 days ago

Or quicker, or more comprehensively for the same price.

nlh

2 days ago

uh_uh

2 days ago

How so?

Philpax

2 days ago

They're on shaky ground right now https://arstechnica.com/information-technology/2026/02/five-...

kingstnap

2 days ago

Its kind of a suck up that more or less confirms the beef stories that were floating around this past week.

dajonker

2 days ago

There was never a $100 billion deal. Only a letter of intent which doesn't mean anything contractually.

esafak

2 days ago

> OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.

They should design their own hardware, then. Somehow the other companies seem to be able to produce fast-enough models.

sumedh

a day ago

> They should design their own hardware

They made a deal with Cerebras for fast inference.

ffitch

2 days ago

> our team was blown away > by how much Codex was able > to accelerate its own development

2 days ago

I think models are smart enough for most of the stuff, these little incremental changes barely matter now. What I want is the model that is fast.

energy123

a day ago

I predict a bifurcation in usage.

Serial usecases ("fix this syntax errors") will go on Cerebras and get 10x faster.

a day ago

Page me when codex can run the right version of node. Are we all changing the system node version to match the current project again?

[shell_environment_policy]

inherit = "all"

experimental_use_profile = true

[shell_environment_policy.set]

NVM_DIR = "[redacted]"

PATH = "[redacted]"

Sn0wCoder

13 hours ago

If you are already using Volta in your project Codex will use the correct version assuming you are running in the same directory as your .json file and the json file has the” volta”:{ “node”: “xx.x.x”, “npm”: “xx.x.x”} configured. Personally use a Dockerfile to setup the container with volta installed. Need to set up Volta and configure at least one version of Node then install Codex in the docker. One caveat is you need to update codex with the initial version of node assuming it’s not the same as your project. If you are using one image per project you should never run into this but I have been using one image and firing up a container for each project, so it was great to see Codex able to use the correct version configured for the project via Volta.

From other comments sounds like Codex using mise for internal tools can cause issues but not sure that is 100% Codex fault if the project is not already defining the node/npm version in the json “engines” entry. If it’s ignoring that entry then I guess this is a valid complaint, but not sure how Codex is supposed to guess which version of tools to use for different projects.

Would you mind adding more details as to the exact setup where Codex is using the wrong version?

cheriot

9 hours ago

Codex is using a login shell so moving my PATH setup to .zprofile fixed it (previously was in .zshrc). Now we just need to write this on the internet enough times that future codex can suggest the fix :p

maxkfranz

21 hours ago

It worked for me after I configured mise. I needed the mise setup in both `.zprofile` and `.zshrc` for Codex to pick it up. I think mise sets up itself in one of those by default, but Codex uses the other. I expect the same problem would present itself with nvm.

I.e. `eval "$(/Users/max/.local/bin/mise activate zsh)"` in `.zprofile` and `.zshrc`

Then Codex will respect whatever node you've set as default, e.g.:

    mise install node@24
    mise use -g node@24

Codex might respect your project-local `.nvmrc` or `mise.toml` with this setup, but I'm not certain. I was just happy to get Codex to not use a version of node installed by brew (as a dependency of some other package).

cheriot

9 hours ago

Thanks! I moved my PATH setup to .zprofile and everything works now. Brew had added itself to .zprofile and everything else was in .zshrc.

maxkfranz

6 hours ago

Glad it worked out. And I agree it’s annoying that this doesn’t just work out of the box. It’s not like node/nvm are uncommon, so you’d think they would have ran into the issue when using their own tool.

smarx007

a day ago

Both Claude and Gemini (the web variants, not CLI) tried to downgrade my .NET 10 projects to .NET 9 at least a few times.

dawidg81

2 days ago

May AI not write the code for me.

May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.

katspaugh

2 days ago

Honest question: have you tried evolving your code architecture when adding features instead of just "promting more and more things"?

dawidg81

2 days ago

I've tried that too but it was almost the same, chatgpt kept forgetting many things about the code and project structure. In summary AI can get problematic for me and i get with troubles with it. This is like one of the reasons why I still prefer traditional text editor for writing code like Vim over a "software on steroids" like VS Code and things like that...

pixl97

2 days ago

> What if one day AI will fall down and there will be no real programmers to write the software.

brikym

a day ago

This is the thread for GPT 5.3

binsquare

2 days ago

At first try it solved a problem that 5.2 couldn't previously.

Seems to be slower/thinks longer.

koolala

2 days ago

I want to recompile a Rust project to be f32 instead of f64.

Am I better off buying 1 month of Codex, Claude, or Antigravity?

I want to have the agent continuesly recompile and fix compile errors on loop until all the bugs from switching to f32 are gone.

int_19h

a day ago

Antigravity (and Gemini in general) is not on par with the rest when it comes to agentic coding.

Between Codex and Claude, Codex will have much more generous limits for the same price, especially if you use top-of-the-line models (although for your task, Sonnet might actually be good enough).

azuanrb

a day ago

Codex by a mile. Also, there's double rate limit until April. So you're paying 1 month for 2 months usage.

TuxSH

2 days ago

If I'm not mistaken Codex is free until April 2nd with the previous generous rate limits (while paying customers get 2x).

vatsachak

2 days ago

Literally just find and replace

koolala

2 days ago

find and replace is step 1 that generates all the compile errors I want it to loop through

I'm wanting to do it on an entire programming language made in rust: https://github.com/uiua-lang/uiua

Because there are no float32 array languages in existence today

xyzsparetimexyz

2 days ago

Why do you want a float32 array language? Anyway the free glm4.6 model that is opencode defaults to should be fine. Why pay for something to do this.

aavci

21 hours ago

“our team was blown away by how much Codex was able to accelerate its own development.”

At what point will LLMs be autonomously self creating new versions of themselves?

jdthedisciple

2 days ago

Gotta love how the game demo's page title is "threejs" – I guess the point was to demo its vibe-coding abilities anyway, but yea..

sidgarimella

2 days ago

Many are saying codex is more interactive but ironically I think that very interactivity/determinism works best when using codex remotely as a cloud agent and in highly async cases. Conversely I find opus great locally, where I can ram messages into it to try to lever its autonomy best (and interrupt/clean up)

tyfon

2 days ago

I'm having a hard time parsing the openai website.

Anyone know if it is possible to use this model with opencode with the plus subscription?

mbil

regularfry

2 days ago

I've tried opus 4.5 in opencode via the GitHub Copilot API, mostly to see if it works all. I don't think that broke any terms of service? But also I haven't checked how much more expensive I made it for myself over just calling them directly.

rs_rs_rs_rs_rs

2 days ago

speedgoose

a day ago

I don’t know, you ask above whether it can do svelte now.

davidmurdoch

2 days ago

5.2 was already very good with svelte 5, at least when you have the svelte MCP server set up.

bg24

2 days ago

I am on a max subscription for Claude, and hate the fact that OpenAI have not figured out that $20 => $200 is a big jump. Good luck to them. In terms of model, just last night, Codex 5.2 solved a problem for me which other models were going round and round. Almost same instructions. That said, I still plan to be on $100 Claude (overall value across many tasks, ability to create docs, co-work), and may bump up OpenAI subscription to the next tier should they decide to introduce one. Not going to $200 even with 5.3, unless my company pays for it.

aerhardt

2 days ago

I have a pre-paid account directly with OpenAI that has credits, but if I use that key with the Codex CLI, it can't access 5.3 either.

Google Gemini definitely has structured output.

roflcopter69

a day ago

Not so fast! Check this out https://github.com/googleapis/python-genai/issues/460

In my experience, you can only use Gemini structured outputs for the most trivial of schemas. No integer literals, no discriminated unions and many more paper cuts. So at least for me, it was completely unusable for what I do at work.

jiggawatts

16 hours ago

That's the level of coding I expect from a bunch of Python-only ML computer scientists, but still... wow.

On the upside, they seem to have fixed it: https://blog.google/innovation-and-ai/technology/developers-...

wahnfrieden

2 days ago

Can you elaborate what you mean - OAI structured outputs means JSON schema doesn't it? So are you just saying they both support JSON schema but Anthropic has a limitation?

koakuma-chan

2 days ago

OpenAI, in addition to JSON schema, supports "context-free grammar"[0], i.e. regex and lark. Anthropic also supports JSON schema since a few weeks ago, but they don't support specifying the length of JSON array, so you still have to worry about the model producing invalid output.

[0]: https://platform.openai.com/docs/guides/function-calling#con...

One thing that pisses me off is this widespread misunderstanding that you can just fall back to function calling (Anthropic's function calling accepts JSON schema for arguments), and that it's the same as structured outputs. It is not. They just dump the JSON schema into the context without doing the actual structured outputs. Vercel's AI SDK does that and it pisses me off because doing that only confuses the model and prefilling works much better.

OutOfHere

2 days ago

They both are doing this to each other.

BTW, loser is spelled with a single o.

wahnfrieden

2 days ago

You could also claim that Anthropic is trying to scoop OpenAI by launching minutes earlier, as OpenAI has done with Google in the past.

For downvoters, you must be naive to think these companies are not surveilling each other through various means.

fHr

2 days ago

lol cope harder