1.3x when working on a large janky which codebase I am very familiar with, very unevenly distributed.
- Writing new code it's probably 3x or so[1].
- Writing automated tests for reproducible bugs, it's probably 2x or so.
- Fixing those bugs I try every so often but it still seems to be a net negative even for Opus 4.5, so call it 0.95x because I mostly just do it myself.
- Figuring out how to reproduce an undesired behavior that was observed in the wild in a controlled environment is still net negative - call it 0.8x because I keep being tempted by this siren song[2]
- Code review it's hard to say, I definitely am able to give _better_ reviews now than I was able to before, but I don't think I spend significantly less time on them. Call it 1.2x.
- Taking some high-level feature request and figuring which parts of the feature request already exist and are likely to work, which parts should be built, which parts we tried to build 5+ years ago and abandoned due to either issues with the implementation or issues with the idea that only became apparent after we observed actual users using it, and which parts are in tension with other parts of the system: net negative. 0.95x, just from trying again every so often.
- Writing new one-off utility tools for myself and my team: 10x-100x. LLMs are amazing. I can say "I want to see a Gantt chart style breakdown of when jobs in a gitlab pipeline start and finish each step of execution, here's the network log, here's a link to the gitlab api docs, write me a bookmarklet I can click on when I'm viewing a pipeline" and go get coffee and come back and have a bookmarklet[3].
Unfortunately for me, a significant fraction of my tasks are of the form "hey so this weird bug showed up in feature X, and the last employee to work on feature X left 6 years ago, can you figure out what's going on and fix it" or "we want to change Y functionality, what's the level of risk and effort".
-----
[1] This number would be higher, but pre-LLMs I invested quite a bit of effort into tooling to make repetitive boilerplate tasks faster, so that e.g. creating the skeleton of a unit or functional test for a module was 5 keystrokes. There's a large speedup in the tasks that are almost boilerplate, but not quite worth it for me to write my own tooling, counterbalanced by a significant slowdown if some but not all tasks had existing tooling that I have muscle memory for but the LLM agent doesn't.
[2] This feels like the sort of thing that the models should be good at. After all, if I fed in the observed behavior, the relevant logs, and the relevant files, even Sonnet 3.7 was capable of identifying the problem most of the time. The issue is that by the time I've figured out what happened at that level of detail, I usually already know what the issue was.
[3] Ok, it actually took a coffee break plus 3 rounds of debugging over about 30 minutes. Still, it's a very useful little tool and one I probably wouldn't have spent the time building in the before times.