I've been paying for the $20/m plan from Anthropic, Google, and OpenAI for the past few months (to evaluate which one I want to keep and to have a backup for outages and overages).
Gemini never goes down, OpenAI used to go down once in a while but is much more stable now, and Anthropic almost never goes a full week without throwing an error message or suffering downtime. It's a shame because I generally prefer Claude to the others.
Maybe they have vibe-coded their own stack!
But less tongue-in-cheek, yeah Anthropic definitely has reliability issues. It might be part of trying to move fast to stay ahead of competitors.
They have. Claude Code was their internal dev tool, and it shows.
And yet even dogfooding their own product heavily, it's still a giant janky pile. The prompt work is solid, the focus on optimizing tools was a good insight, and the model makes a good agent, but the actual claude code software is pretty shameful to be the most viable product of a billion dollar company.
What artifact are you evaluating to come to this conclusion? Is the implementation available?
A. I use it daily to take advantage of the plan inference discount.
B. Let's just say I didn't write the most robust javascript decompilation/deminification engine in existence solely as an academic exercise :)
The tongue-in-cheek jokes are kind of obvious, but even without the snark I think it is worth asking why the supposed 100x productivity boost from Claude Code I keep hearing about hasn't actually resulted in reliability improvements, even from developers who presumably have effectively-unlimited token budgets to spend on improving their stack.
I love how people like Simon Willison and Pete Steinberger spend all this effort trying to be skeptical of their own experiences and arrive at nuanced takes like “50% more productive, but that’s actually a pretty big deal, but the nature of the increase is complicated” and y’all just keep repeating the brainrotted “100x, juniors are cooked” quote you heard someone say on LinkedIn.
AI gives you what you ask for. If you don't understand your true problems, and you ask it to solve the wrong problems, it doesn't matter how much compute you burn, you're still gonna fail.
All the AI labs are but Anthropic is the worst. Anyone serious about running Claude in prod is using Bedrock or Vertex. We've been pretty happy with Vertex.
I wonder why they haven't invested a lot more in the inference stack? Is it really that different from Google, OpenAI and other open weight models?