There is an AI code review bubble

46 pointsposted 4 hours ago
by dakshgupta

31 Comments

candiddevmike

an hour ago

None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.

Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.

shagie

26 minutes ago

In some code that I was working on, I had

    // stuff
    obj.setSomeData(something);
    // fifteen lines of other code
    obj.setSomeData(something);
    // more stuff
The 'something' was a little bit more complex, but it was the same something with slightly different formatting.

My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.

It also caught a repeat call in

    List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList();
    
    // ...

    var something = someFunc(objs);

    Thingy someFunc(List<Objs> param) {
        return param.stream().filter(o -> o.field.isPresent()). ...
Where one of the filter calls is unnecessary... and it caught that across a call boundary.

So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).

dakshgupta

42 minutes ago

I agree that none perform _super_ well.

I would argue they go far beyond linters now, which was perhaps not true even nine months ago.

To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.

blibble

3 minutes ago

> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.

do you have a bot to do this too?

tadfisher

30 minutes ago

Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.

onedognight

33 minutes ago

I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.

boredtofears

16 minutes ago

That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything

written-beyond

25 minutes ago

I mean how far Rusts own clippy lint went before any LLMs was actually insane.

Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.

athrowaway3z

18 minutes ago

> The SOTA isn't capable of using a code diff as a jumping off point.

Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md

It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.

vimda

27 minutes ago

Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience

geooff_

38 minutes ago

This article has a catchy headline, but there's really no content to it. This is content marketing without content. It seems like every week on Hacker News, there's a dozen of these. All seemingly code reviewers, too. Keep it to LinkedIn.

MichaelRo

22 minutes ago

Also lots of clueless Indians. Like some post I saw recently on a traders board where the guys vibecoded a HFT system and were advertising with the hope of selling it. Posted "10 nanoseconds latency!". When people pointed out the absurdity of their claim, they quickly corrected "sorry, vibe coding mistake, 100 microseconds". Last time I checked their website, it settled for 10us, because that's what "super performant professional system" means. Ridiculous.

ahmadyan

an hour ago

Problem with Code Review is it is quite straightforward to just prompt it, and the frontier models, whether Opus or GPT5.2Codex do a great job at code-reviews. I don't need second subscription or API call when the first one i already have and focus on integration works well out of the box.

In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.

personjerry

an hour ago

I don't really understand how this differentiates against the competition.

> Independence

Any "agent" running against code review instead of code generation is "independent"?

> Autonomy

Most other code review tools can also be automated and integrated.

> Loops

You can also ping other code review tools for more reviews...

I feel like this article actually works against you by presenting the problem and inadequately solving them.

dakshgupta

an hour ago

> Independence

It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?

> Autonomy

Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.

> Loops

You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.

sdenton4

36 minutes ago

Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.

It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.

liamconnell

40 minutes ago

> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.

Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”

“Vibes” (2/3 on scale) are ok, just honestly curious.

pawelduda

18 minutes ago

Good code reviews are part of team's culture and it's hard to just patch it with an agent. With millions of tools it will be arms race between which one is louder about as many things as possible because:

- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,

- it will sometimes find real big issue which reinforces the bias that it's useful

- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?

So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive

Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.

TuringTest

22 minutes ago

>A human rubber-stamping code being validated by a super intelligent machine is the equivalent of a human sitting silently in the driver's seat of a self-driving car, "supervising".

So, absolutely necessary and essential?

In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.

taude

23 minutes ago

It's not terribly hard to write a Copilot GHA that does this yourself for your specific teams needs. Not sure why you'd been to bring a vendor on for this....

What do the vendors provide?

I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....

jackconsidine

22 minutes ago

> Only once would you have X write a PR, then have X approve and merge it to realize the absurdity of what you just did.

I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason

pnathan

13 minutes ago

Claude code's code review is _sufficient_ imo.

still need HITL, but the human is shifted right and can do other things rather than grinding through fiddly details.

quanwinn

an hour ago

I liked that the post is self-aware that it's promoting its own product. But the writing seemed more focus on the philosophy behind code reviews and the impact of AI, and less on the mechanics of how greptile differs from competitors. I was hoping to see more on the latter.

sastraxi

an hour ago

Contrary to some of the other anecdotes in this thread, I've found automated code review to discover some tricky stuff that humans missed. We use https://www.cubic.dev/

aurareturn

31 minutes ago

Before I push any code, I always ask 2 different frontier LLMs to review the changes for any potential issues. Saved my ass a few times before pushing to production.

pomarie

38 minutes ago

Founder of cubic here, thanks for the shoutout!

trjordan

an hour ago

1. I absolutely agree there's a bubble. Everybody is shipping a code review agent.

2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.

More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.

The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.

dakshgupta

an hour ago

2. There is plenty of evidence for this elsewhere on the site, and we do encourage people to try it because like with a lot of AI tools, YMMV.

You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.

ahmadyan

39 minutes ago

> 2. What on earth is this defense of their product?

i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.

lenerdenator

29 minutes ago

> More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.

> The point of a PR is to share knowledge and to catch structural gaps.

Well, it was to share knowledge and to catch structural gaps.

Now you have an idea, for better or for worse, that software needs to be developed AI-first. That's great for the creation of new code but as we all know, it's almost guaranteed that you'll get some bad output from the AI that you used to generate the code, and since it can generate code very fast, you have a lot of it to go through, especially if you're working on a monorepo that wasn't architected particularly well when it was written years ago.

PRs seem like an almost natural place to do this. The alternative is the industry finding a more appropriate place to do this sort of thing in the SDLC, which is gonna take time, seeing as how agentic loop software development is so new.