hackernews client

There is an AI code review bubble

351 pointsposted 12 days ago

264 Comments

zmmmmm

12 days ago

My experience with using AI tools for code review is that they do find critical bugs (from my retrospective analysis, maybe 80% of the time), but the signal to noise ratio is poor. It's really hard to get it not to tell you 20 highly speculative reasons why the code is problematic along with the one critical error. And in almost all cases, sufficient human attention would also have identified the critical bug - so human attention is the primary bottleneck here. Thus poor signal to noise ratio isn't a side issue, it's one of the core issues.

As a result, I'm mostly using this selectively so far, and I wouldn't want it turned on by default for every PR.

Quarrelsome

12 days ago

> but the signal to noise ratio is poor

Nail on the head. Every time I've seen it applied, its awful at this. However this is the one thing I loathe in human reviews as well, where people are leaving twenty comments about naming and then the actual FUNCTIONAL issue is just inside all of that mess. A good code reviewer knows how to just drop all the things that irk them and hyperfocus on what matters, if there's a functional issue with the code.

I wonder if AI is ever gonna be able to conquer that one as its quite nuanced. If they do though, then I feel the industry as it is today, is kinda toast for a lot of developers, because outside of agency, this is the one thing we were sorta holding out on being not very automatable.

eieio

12 days ago

at my last job code review was done directly in your editor (with tooling to show you diffs as well).

What this meant was that instead of leaving nitpicky comments, people would just change things that were nitpicky but clear improvements. They'd only leave comments (which blocked release) for stuff that was interesting enough to discuss.

This was typically a big shock for new hires who were used to the "comment for every nitpick" system; I think it can feel insulting when someone changes your feature. But I quickly came to love it and can't imagine doing code review any other way now. It's so much faster!

I'm not sure how to tie this to AI code review tbh. Right now I don't think I'd trust a model's taste for when to change things and when to leave a comment. But maybe that'll change. I agree that if you automated away my taste for code it'd put me in a weird spot!

atq2119

11 days ago

Was that Jane Street? I remember watching a presentation from someone there about such a system.

If not, any chance this tooling is openly available?

eieio

11 days ago

> Was that Jane Street?

yep

dbtablesorrows

10 days ago

I think the closest such thing we have is "suggestions" on github and gitlab.

delusional

11 days ago

This is the workflow I've always dreamed of. In a lot of ways making a change which is then submitted as patch to their patch isn't really that different from submitting a comment to their patch. The workflow of doing that directly in editor is just wonderful.

If I had to pick, I actually think ONLY being able to submit "counter-patches" would be better than only being able to submit comments. Comments could just be actual programming language style comments submitted as code changes.

johntash

11 days ago

What if you have two people with different ideas of how to name a certain variable and they just flip the name back and forth every release?

I like this review method too though, and like that some pr review tools have a 'suggest changes' and 'apply changes' button now too

pbalau

11 days ago

> What if you have two people with different ideas of how to name a certain variable and they just flip the name back and forth every release?

Fire both. There is no amount of skill and productivity that can justify that amount of pettiness.

eieio

11 days ago

Typically in this system you encode obligations - e.g. "eieio should review, or at least be aware of, all changes made to this library." I think that means you're unlikely in practice to have a problem like that, which (unless the team is not functioning well) requires two people who care deeply about the variable name and don't know that someone else is changing it.

ljm

11 days ago

I think it's a good idea to have a style guide of sorts that you can point to when people sweat the small stuff.

coldtea

11 days ago

>What if you have two people with different ideas of how to name a certain variable and they just flip the name back and forth every release?

You fire both or at least one of them. Problem solved.

ta988

11 days ago

Apply rule 8 of Go

PurelyApplied

11 days ago

If minor mistakes are corrected without the PR author's input, do they ever learn to stop making those mistakes themselves? It seems like a system where you just never bother to learn, e.g., style conventions, because reviewers just apply edits as needed.

Phlebsy

12 days ago

> What this meant was that instead of leaving nitpicky comments, people would just change things that were nitpicky but clear improvements. They'd only leave comments (which blocked release) for stuff that was interesting enough to discuss.

This is my dream; have only had a team with little enough ego to actually achieve it once for an unfortunately short period of time. If it's something that there's a 99% chance the other person is going to say 'oh yeah, duh' or 'sure, whatever' then it's just wasting both of your time to not just do it.

That said, I've had people get upset over merging their changes for them after a LGTM approval when I also find letting it sit to be a meaningless waste of time.

simoncos

9 days ago

Interesting approach. I think it could have the reviewers to be more serious about their feedback. Comments are a bit too casual and may contain more "unconstructive" information.

vidarh

11 days ago

I just this morning had someone "nitpick" on a PR I made and ask for a change that would have broken the code.

If the reviewer can make changes without someone reviewing their change, it's just waiting to blow up in. your face.

eieio

11 days ago

Yes, in the system I'm describing if a reviewer changed your code, you reviewed their change.

user

11 days ago

[deleted]

jcgl

11 days ago

That sounds great. Was that proprietary tooling? I'd be interested in some such thing.

eieio

11 days ago

The tool (iron) isn't open source, but there are a bunch of public talks and blogs about how it works, many of which are linked from the github repo[1].

It used to be "open source" in that some of the code was available, but afaik it wasn't ever possible to actually run it externally because of how tightly it integrated with other internal systems.

[1] https://github.com/janestreet/iron

gbalduzzi

11 days ago

If I understood correctly, the same can be done on VS Code with the github plugins (for github PRs)

It's pretty straightforward: you checkout a PR, move around, and either make some edits (that you can commit and push to the feature branch) or add comments.

jcgl

11 days ago

Good to know about its existence. I think I'll have to do my own sleuthing though, since I'm a (neo)vim user who dislikes GitHub.

just6979

10 days ago

Yeah, it's called git: make your own branch from the PR branch, commit and push the nitpick change, tell the author, and they can cherry-pick it if they approve.

Gitlab has this functionality right in the web UI. Reviewers can suggest changes, and if the PR author approves, a commit is created with the suggested change. One issue with this flow it that's it doesn't run any tests on the change before it's actually in the PR branch, so... Really best for typos and other tiny changes.

Alternatively you actually, you know, _collaborate_ with the PR author, work it out, run tests locally and/or on another pushed branch, and someone then pushes a change directly to the PR.

The complaints about nitpicks slowing things down too much or breaking things sound like solo-hero devs who assume their god-like PRs should be effectively auto-approved because how could their code even contain problems... No wonder they love working with "Dr Flattery the Always Wrong Bot".

*(Hilarious name borrowed from Angela Collier)

jcgl

10 days ago

I think you misunderstood the tooling I was asking about. This is what was mentioned:

> at my last job code review was done directly in your editor (with tooling to show you diffs as well).

That's not covered by git itself. And it's not covered by Gitlab, GitHub, or any other web-based forge.

> Alternatively you actually, you know, _collaborate_ with the PR author, work it out, run tests locally and/or on another pushed branch, and someone then pushes a change directly to the PR.

Of course you should collaborate with the author. This tooling is a specific means to do that. You yourself are of course free to not like such tooling for whatever reason.

> The complaints about nitpicks slowing things down too much or breaking things sound like solo-hero devs who assume their god-like PRs should be effectively auto-approved because how could their code even contain problems... No wonder they love working with "Dr Flattery the Always Wrong Bot".

Did you maybe respond to the wrong person? I'm not sure how that relates to my comment at all.

zenolijo

12 days ago

Naming comments can be very useful in code that gets read by a lot of people. It can make the process of understanding the code much quicker.

On the other hand, if it's less important code or the renaming is not clearly an improvement it can be quite useless. But I've met some developers who has the opinion of reviews as pointless and just say "this works, just approve it already" which can be very frustrating when it's a codebase with a lot of collaboration.

daotoad

12 days ago

Naming comments are useful when someone catches something like:

1. you are violating a previously agreed upon standard for naming things

2. inconsistent naming, eg some places you use "catalog ID" and other places you use "item ID" (using separate words and spaces here because case is irrelevant).

3. the name you chose makes it easy to conflate two or more concepts in your system

4. the name you chose calls into question whether you correctly understood the problem domain you are addressing

I'm sure there are other good naming comments, but this is a reasonable representation of the kinds of things a good comment will address.

However, most naming comments are just bike shedding.

Arainach

12 days ago

If the person reading the code doesn't quickly understand what's going on from the name or finds the name confusing, the name is poor and should be changed. It is way too easy for the author to be caught up in their mental model and to be unaware of their implicit assumptions and context and choose a name that doesn't make sense.

The bigger problem is people who feel ownership of shared codebases tied to their ego and who get angry when people suggest changes to names and other bits of interfaces instead of just making the suggested change.

If you get code review feedback, the default answer is "Done" unless you have a strong reason not to. If it's not obvious whether the name suggested by the author or the reader is better, the reader's choice should be taken every time.

layer8

11 days ago

> If the person reading the code doesn't quickly understand what's going on from the name or finds the name confusing, the name is poor and should be changed.

I used to think that way, but in many nontrivial circumstances, every conceivable name will be a mismatch for where some person is coming from, and not be self-evident for their mental model. Even the same person, over a longer time span. There is often a gap to bridge from name to meaning, and a comment isn’t the worst way to bridge it.

hinkley

9 days ago

I find the thesaurus helps a lot with this. Actually more than just naming, because often a word in the synonym list will stand out as a more accurate representation of the concept you’re trying to add to the code, in a way that reveals subtasks that will substantially increase the value of the feature.

In short I use it as a form of rubber ducking. No it’s not like this word, it’s more like that one, but most of all like this one.

hinkley

9 days ago

I’ve seen this enough now to consider it a trope instead of a coincidence. There’s that one or two guys on the team who may be noteworthy in their math clever but only high school reading level, who use the same word in three parts of the code but use a different dictionary definition each time. They don’t see the big deal, they can keep it straight in their head, they insist. And if you can’t then you must be dumb instead of what you really are, which is sick of his bullshit.

Given enough time and rope, these parts of the code start to encroach on each other and the cracks start to show. There are definitely bugs the smart guy introduces because no, in fact, you can’t keep them straight in your head either.

So it does matter if you use, as a top of my head example, the word “account” for both the user and group management features of the app and to describe an entry to an incident report in another part. It will bite you in the ass, and it’s easier to change now when there are three references instead of 23.

Quarrelsome

12 days ago

> Naming comments can be very useful in code that gets read by a lot of people. It can make the process of understanding the code much quicker.

yes but it can be severely diminishing returns. Like lets step back a second and ask ourselves if:

var itemCount = items.Count;

var numberOfItems = items.Count;

is ever worth spending the time discussing, versus how much of a soft improvement it makes to the code base. I've literally been in a meeting room with three other senior engineers killing 30 minutes discussing this and I just think that's a complete waste of time. They're not wrong, the latter is clearer, but if you have a PR that improves the repo and you're holding it back because of something like this, then I don't think you have your priorities straight.

hinkley

9 days ago

Generally you’d like the variable to imply a call to action. Even if the call to action is for a feature still in the backlog.

Over time I’ve developed some tricks that invite people to add features to the code in the “right” place, and this is one of them. Once in a while someone gets credit for work I already thought to do but didn’t have time. But for every one of those there’s a half dozen or a dozen cases of increasing the bus number on a block of code I wrote be nerd sniping people into making additions while I’m busy with something else.

spooky_action

12 days ago

Sorry for the dumb question, is the second version actually better than the first? Because I prefer the first. But perhaps you chose this as a particularly annoying/unuseful comment

Quarrelsome

12 days ago

I personally don't give a shit either way but I've worked in dev shops with a clear preference for the second one. I can see their point because the code as natural language parses better but I don't think its strong enough to care about.

Sort of place that is fussy about test naming so where I would do smth like:

TestSearchCriteriaWhere

they'd want

Test_That_Where_Clauses_In_Search_Criteria_Work

I think its a waste of typing but idk, I'm willing to let it slide because I think its a pointless hill to die on.

tharkun__

12 days ago

Let's take it up a notch!

    var itemCount = items.Count;

depends on what `items` is, no? Is the `.Count` O(1)? Do you really need a variable or is it fine for the (JIT) compiler to take care of it? Is it O(n) and n is significant enough? Maybe you need a variable and spend time arguing about that name. Yes I chose this because almost everyone I know at least would argue you always have to create the variable (and then argue about the name) ;)

    fussy about test naming

I get fussiness about test naming. I believe that a good test "name" should tell you enough for you to be able to "double check" the test setup as well as the assertions against the test name with some sort of "reasonable" knowledge of the code/problem domain.

As such both of those test names are really bad, because they can't tell anything at all about whether you're testing for the correct thing. How do I know that your assertions are actually asserting that it "works"?

Instead, I'd want a test named something like this (assuming that that's what this particular test is actually about - i.e. imagine this particular test in the context of a user defined search, where one of the options is that they can specify a project to search by and this particular test is about verifying that we check the permissions the user has for said project. There would be different tests for each of the relevant where clauses that specifying a project in the search params would entail and different tests again for each of the other user specifiable parameters that result in one or more where clauses to be generated):

    shouldCheckProjectPermissionsWhenProjectIdInSearchParams()

Every single test case gives you the ability to specify both a good name and clear, concise test assertions. If I see anything but a bunch of assertions related to project permissions for the logged in user in this test, I will fight you tooth and nail on that test ;) I couldn't care less tho if you use camelCase or snake_case or whatever. I just had to choose something to post. I also couldn't care less if you had 17 different assertions in the test (we all know that "rule", right? I think the "test one thing" and "one assertion" is not about the actual number of "assert statements". People that think that, got the rule wrong. It's all about the "thing" the assertions test. If you have 17 assertions that are all relevant to testing the project permission in question then they're great and required to be there. If 1 is for asserting the project permissions and the other 16 are repeating all the other "generic assertions" we copy and pasted from previous tests, then they're not supposed to be there. I will reject such a PR every time.

Quarrelsome

11 days ago

It matters if there's a lot of churn or the test fails a lot but if its a test that I write and for whatever reason it never fails I think we've just wasted our time on being fussy.

I appreciate neither of those test names are great but it was just a strawman example to show the fussiness.

solid_fuel

11 days ago

If I was going to nitpick it I would point out that `itemsCount` could easily be confused with `items.Count`, or vice versa, depending on syntax highlighting. That kind of bug can have a negative impact if one or the other is mutated while the function is running.

So clearly distinguishing the local `numberOfItems` from `items.Count` _could_ be helpful. But I wouldn't ping it in a review.

layer8

11 days ago

That’s why it’s `itemCount` and not `itemsCount`. ;)

(Because the correct English term is “item count”, not “items count”.)

Personally, I tend to only name it “count” if it’s a variable that is used to keep a count, i.e. it is continually incremented as new items are processed.

Otherwise I tend to prefer `numItems`.

Yes, this is very close to bike-shedding. There is, however, an argument to be made for consistency in a code base.

ambicapter

12 days ago

They’re both equally bad to me, I don’t see the improvement over just using item.count. I may be nitpicking a toy example though.

Quarrelsome

12 days ago

I think in this case itemCount had application in a couple of conditions later in the function, so there was value in extracting the count. In my recollection I might be missing some nuance, lets say for the sake of argument it was:

var relevantCount = items.Where(x => x.SomeValue > 5);

var numberOfRelevantItems = items.Where(x => x.SomeValue > 5);

so it wasn't necessarily cheap enough to want to repeat.

Izkata

11 days ago

Almost. I think you're reflexively doing the same thing GP was questioning (which I agree with; in the original example the new variable was just straight duplication of knowledge and is as likely to be the source of bugs as anything else (like if items were added or removed, it's now out of date)).

Here though you're missing the ".Count", so

  var relevantItems = items.Where(x => x.SomeValue > 5);
  relevantItems.Count

As long as it's not a property that's calculated every time it's accessed, this still seems better than pulling the count into its own variable to me.

SchemaLoad

12 days ago

A lot of these comments are not pointing out actual issues, just "That's not how I would have done it" type comments.

brabel

11 days ago

And the most amazing part is that we got a mini PR review in the comments to a single line of code someone posted just to show an example of useless debates :D

xmprt

12 days ago

Human comments tend to be short and sweet like "nit: rename creatorOfWidgets to widgetFactory". Whereas AI code review comments are long winded not as precise. So even if there are 20 humans comments, I can easily see which are important and which aren't.

brabel

11 days ago

We are using BitBucket at work and decided to turn on RovoDev as reviewer. It absolutely doesn’t do that. Few but relevant comments are the norm and when we don’t like something it says we tell it in its instructions file to stop doing that. It has been working great!

SchemaLoad

12 days ago

My coworker is so far on this spectrum it's a problem. He writes sentences with half the words missing making it actually difficult to understand what he is trying to suggest.

All of the non critical words in english aren't useless bloat, they remove ambiguity and act as a kind of error correction if something is wrong.

Quarrelsome

12 days ago

it "nit" short for nitpick? I think prefixing PR comments with prefixes like that is very helpful for dealing with this problem.

hdjrudni

11 days ago

Yes, but I don't know how effective it is. 99% of the time someone leaves a 'nit' the other person fixes it. So we're still dealing with most of them like regular comments. Only once or twice I've been like "nah, I like my way better" but I can only do that if they also leave an LGTM. Sometimes they do. There's one or two people that will hold your code hostage until you reply to every little nit. At that point they don't feel like nits. I always LGTM if the code is functionally correct or if the build breaks in a trivial way (that would also block them from submitting). Then they can address my nits or submit anyway and I'm cool with that.

wtetzner

11 days ago

I wonder if there's a psychological benefit though. If someone states up front that they know something is just a nitpick, the author might be less likely to push back, and therefore it's less likely to end up in a bike shedding back-and-forth.

scott_w

11 days ago

This and when an author wants to ignore it, they do. You don't need to justify your choice since the person is openly saying "I'm bikeshedding" to you.

quuxplusone

11 days ago

> There's one or two people that will hold your code hostage until you reply to every little nit. At that point they don't feel like nits.

If the comment must be addressed before the review is approved, then it is not a nit, it is a blocker (a "changes required"). Blockers should not be marked as nits — nor vice versa.

I agree that prefixing comments with "Nit:" (or vice versa in extreme cases "This is a big one:") is psychologically useful. Yet another reason it's useful is that it's not uncommon for perceived importance to vary over time: you start with "hmm, this could be named blah" and a week later you've convinced yourself it's a blocker — so, force yourself to recognize that it was originally phrased as a nit, and force yourself to come back and say explicitly "I've changed my mind: I think this is important." With or without the "nit/blocker" prefixing pattern, the reviewer may come off as capricious; but with the pattern, he's at least measurably capricious.

simonbw

12 days ago

Yes it is. I've really oijed those convention at places I've worked. It probably wouldn't be too hard to instruct AI's to use this format too.

bentinata

11 days ago

If you're interested: https://conventionalcomments.org/

It may feels to many. I mostly use suggestion, thought, and todo. When I type down "nit..." I realized it usually does not worth it. I'd rather make comment about higher level of the changes.

tr_user

11 days ago

Depends on what you're targeting

- If it's a rough PR, you're looking for feedback on direction rather than nitpicks.

- If it's in a polished state, it's good to nitpick assuming you have a style guide you're intending to adhere to.

Perhaps this can be provided in the system prompt?

hinkley

9 days ago

You can do both.

The noise is often what hides the bug in the first place. Aim for more straightforward code and the bug will often surface.

For a while when Node was switching to async from promise chains, people would bring me code or tests that were misfiring and they couldn’t tell why. Often it was because of either a bug in the promise chaining or someone tried to graft async code into the chain and it was an impedance mismatch.

I would start them by asking them to make the function fully async and then come back. About half the time the code just fixed itself. Because the intention of the code was correct, but there was some subtle bookkeeping or concurrency issue that was obfuscated. About a quarter of the time the bug popped out and the dev fixed it. And about a quarter of the time there was a legitimate bug that had been sleeping. The function was fine ones own, but something it called was broken.

There are a lot of situations like that out there. The code distracts from the real problem, but just fixing the “real problem” is a disservice because another real problem will happen later. Make the change easy. That’s always the middle of the solution.

catlifeonmars

11 days ago

Let me throw something out there: poor naming obscures and distracts from functional issues. You are right about a good reviewer, but a good author strives for clarity in addition to correctness.

As an aside, naming is highly subjective. Like in writing, you tailor naming to the problem domain and the audience.

Cthulhu_

11 days ago

This is why you should set guidelines for reviews (like e.g. https://go.dev/wiki/CodeReviewComments), and ideally automate as much as possible. I'm guilty of this as well, leaving loads of nitpicky code style comments - but granted, this was before Prettier was a thing. In hindsight, I could've spent all that time building a code formatter myself lol.

renegade-otter

11 days ago

If you are nitpicking style or conventions that do not have rules in your linting tools, then those should automatically be non-issues, IMO.

sotix

11 days ago

I follow a consistent comment pattern[0] that makes blocking vs non-blocking easy to identify.

[0]: https://conventionalcomments.org/

jolt42

11 days ago

Don't get me started on not CamelCasing acronyms, acronyms aren't more important than regular words! :)

causalscience

12 days ago

Yeah or worse like my boss. We don't have a style guide. But he always wants style changes in every PR, and those style changes are some times contradictory across different PRs.

Eventually I've told him "if your comment does not affect performance or business logic, I'm ignoring it". He finally got the message. The fact that he accepted this tells me that deep down he knew his comments were just bike shedding.

awesome_dude

12 days ago

I've been in teams like this - people who are lower on the chain of power get run in circles as they change to appease one, then change to appease another then change to go back to appease the first again.

Then, going through their code, they make excuses about their code not meeting the same standards they demand.

As the other responder recommends, a style guide is ideal, you can even create an unofficial one and point to it when conflicting style requests are made

causalscience

11 days ago

> Then, going through their code, they make excuses about their code not meeting the same standards they demand.

Yes!! Exactly. When it comes to my PRs, he once made this snarky comment about him having high expectations in terms of code quality. When it comes to his PRs, he does the things he tells me not to do. In fact, I once sent him a "dis u?" with a link to his own code, as a response to something he told me I shouldn't do. To his credit he didn't make excuses, he responded "I could've done better there, agreed".

In general he's not bad, but his nitpicking is bad. I don't really understand what's going on in his mind that drives this behavior, it's weird.

zeroCalories

12 days ago

You should have a style guide, or adopt one. Having uniform code is incredibly valuable as it greatly reduces the cognitive load of reading it. Same reason that Go's verbose "err != nil" works so well.

brabel

11 days ago

Style guidelines should be enforced automatically. Leaving that for humans to verify is a recipe for conflict and frustration.

zeroCalories

11 days ago

Ideally yes, but there's plenty of cases where that's not desirable or possible.

For example, most people would agree you should use exhaustive checks when possible(matching in rust, records in typescript, etc.). But how do you enforce that? Ban other types of control flow? But even if you find a good balanced way to enforce it, you won't always want to enforce it. There's plenty of good use cases where you explicitly don't want a check to be exhaustive. At which point now you gotta make sure there's an escape mechanism to whatever crackhead check you've setup. Better to just leave a comment with a link to your style guide explaining why this is done. Many experienced developers that are new to rust or typescript simply never think of things like this, so it's worthwhile to document it.

knes

11 days ago

At augment code we specifically build our code review tool to find noise to signal ratio problem. In benchmark our comments are 2 to 3x more likely to get fixed compared to bugbot coderabbit etc

You should check it at Augmentcode.com

marginalia_nu

12 days ago

That's not even mentioning a not insignificant part of the point of code reviews is to propagate understanding of the evolution of the code base among other team members. The reviewer benefits from the act of reviewing as well.

materialpoint

11 days ago

How is that different from today's SA, like CodeQL and SonarQube? Most of the feedback is just sh*t and drives programmers towards making senseless perfections that just double the amount of work had to be done later to toggle or tune behaviour, because the configurable variables are gone due to bad static code analysis. Clearly present intent and convience like: Making a method virtual, adding a public method, not making a method static when it is likely to use instance fields in the future --- these good practices are shunned in all SA just because the rules are opportunistic, not real.

shakna

11 days ago

My experience at work: Claude regularly says to use one method over another, because it's "safer"... But the method doesn't actually exist in that language. Seems to get rather confused between C# and C++, despite also getting told the language, before and after getting handed the code.

greymalik

12 days ago

It very much depends on the product. In my experience, Copilot has terrible signal noise. But Bugbot is incredible. Very little noise and it consistently finds things the very experienced humans on my team didn’t.

colechristensen

12 days ago

One thing I've found to be successful is to

1) give it a number of things to list in order of severity

and

2) tell it to grade how serious of a problem it may be

The human reviewer can then look at the top ten list and what the LLM thinks about its own list for a very low overhead of thinking (i.e. if the LLM thinks its own ideas are dumb a human probably doesn't need to look into them too hard)

It also helps to explicitly call out types of issue (naming, security, performance, correctness, etc)

The human doesn't owe the LLM any amount of time considering, it's just an idea generating tool. Looking through a top ten list formatted as a table can be scanned in 10 seconds in a first pass.

ljm

11 days ago

I've only managed to use it as a linter-but-on-steroids because, where I'd normally page through the Ruby docs about enumerators to find the exact method that does what someone has implemented in a PR (because there's almost always something in there that can help out), I can instead prompt to look up a more idiomatic version of the implementation for the ruby version being used. It's easy to cross-check and it saves me some time.

It's not very good with the rest, because there's an intuition that needs to be developed over time that takes all the weirdness into account. The dead code, the tech debt, the stuff that looks fundamentally broken but is depended on because of unintended side effects, etc. The code itself is not enough to explain that, it's not a holistic documentation of the system.

The AI is no different to a person here: something doesn't 'feel' right, you go and fix it, it breaks, so you have to put it back again because it's actually harder than you think to change it.

jamesfinlayson

12 days ago

I've been using it a bit lately and at first I was enjoying it, but then it quickly devolved into finding more different minor issues with each minor iteration, including a lovely loop of check against null rather than undefined, check against undefined rather than null etc.

m463

11 days ago

> signal to noise ratio is poor

I think this is the problem with just about every tool that examines code.

I've had the same problem with runtime checkers, with static analysis tools, and now ai code reviews.

Might be the nature of the beast.

probably happens with human code reviews too. Lots of style false positives :)

user

11 days ago

[deleted]

dakshgupta

11 days ago

The signal-to-noise ratio problem is unexpectedly difficult.

We wrote about our approach to it some time ago here - https://www.greptile.com/blog/make-llms-shut-up

Much has changed on our approach since then, so we'll probably write a a new blog post.

The tl;dr of what makes it hard is - different people have different ideas of what a nitpick is - it's not a spectrum, the differences are qualitative - LLMs are reluctant to risk downplaying the severity of an issue and therefore are unable to usefully filter out nits. - theory: they are paid by the token and so they say more stuff

zmmmmm

11 days ago

very interesting! yes everything you say aligns with my experience and instincts.

furyofantares

12 days ago

I agree but find it's fairly easy noise to ignore.

I wouldn't replace human review with LLM-review but it is a good complement that can be run less frequently than human review.

Maybe that's why I find it easy to ignore the noise, I have it to a huge review task after a lot of changes have happened. It'll find 10 or so things, and the top 3 or 4 are likely good ones to look deeper into.

storystarling

11 days ago

I suspect the noise is largely an artifact of cost optimization. Most tools restrict the context to just the diff to save on input tokens, rather than traversing the full dependency graph. Without seeing the actual definitions or call sites, the model is forced to speculate on side effects.

hinkley

9 days ago

I’d like to see someone try AI guided property based testing. The random walk could take a long time ti notice problems with code. I’d trust AI more if it brought proof instead of speculation

iryna_kondr

11 days ago

My experience is similar. AI's context is limited to the codebase. It has limited or no understanding of the broader architecture or business constraints, which adds to the noise and makes it harder to surface the issues that actually matter.

matsemann

11 days ago

It also acts as mainly an advanced linter. The other day it pointed out some overall changes in a piece of code, but didn't catch that the whole thing was useless and could've been replaced with an "on conflict to update" in postgres.

Now, that could happen with a human reviewer as well. But it didn't catch the context of the change.

thesurlydev

12 days ago

For the signal to noise reason, I start with Claude Code reviewing a PR. Then I selectively choose what I want to bubble up to the actual review. Often times, there's additional context not available to the model or it's just nit picky.

just6979

10 days ago

Wait, so you have the LLM review, then you review (selectively choose) the proposed review, then you (or the LLM?) review the reviewed review? But often times (so, the majority?) the initial LLM review is useless, so you're reviewing reviews that won't pass review...

Sounds incredibly pointless. But at least you're spending those tokens your boss was forced to buy so the board can tell the investors that they've jumped on the bandwagon, hooray!

lanyard-textile

12 days ago

Agreed.

I have to constantly push back against it proposing C++ library code, like std::variant, when C-style basics are working great.

CSMastermind

12 days ago

You should try Codex. There's a pretty wide gap between the quality of code review tools out there.

biophysboy

12 days ago

I absolutely hate the verbosity of AI. I know that you can give it context; I have done it, and it helps a little. It will still give me 10 "ideas", many of which are closely related to each other.

candiddevmike

12 days ago

None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.

Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.

dakshgupta

12 days ago

I agree that none perform _super_ well.

I would argue they go far beyond linters now, which was perhaps not true even nine months ago.

To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.

onedognight

12 days ago

I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.

Sharlin

12 days ago

TBH that sounds like TeamScale just has too verbose default settings. On the other hand, people generally find almost all of the lints in Clippy's [1] default set useful, but if you enable "pedantic" lints, the signal-to-noise ratio starts getting worse – those generally require a more fine-grained setup, disabling and enabling individual lints to suit your needs.

[1] https://doc.rust-lang.org/stable/clippy/

blibble

12 days ago

> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.

do you have a bot to do this too?

BlackFly

11 days ago

For it to be evidence, you would need to know the number of Greptile comments made and how many of those comments were instead considered to be poor. You need to contrast false positive rate with true positive rate to simply plot a single point along a classifier curve. You would then need to contrast that with a control group of experts or a static linter which means you would need to modify the "conservativeness" of the classifier to produce multiple points along its ROC curve, then you could compare whether the classifier is better or worse than your control by comparing the ROC curves.

Sample number of true positives says more or less nothing on its own.

boredtofears

12 days ago

That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything

johnsillings

12 days ago

I like number of "great catches" as a measure of AI code review effectiveness

mulmboy

12 days ago

People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment

estimator7292

12 days ago

When mature, funtional adults say it, the read is "wow, I would have missed that, good job, you did better than me".

Reading embarrassment into that is extremely childish and disrespectful.

mulmboy

12 days ago

What I'm saying is that a corporate or professional environment can make people communicate in weird ways due to various incentives. Reading into people's communication is an important skill in these kinds of environments, and looking superficially at their words can be misleading.

written-beyond

12 days ago

I mean how far Rusts own clippy lint went before any LLMs was actually insane.

Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.

user

12 days ago

[deleted]

tadfisher

12 days ago

Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.

fragmede

12 days ago

over 7 days does contextualize it some, though.

9,078 comments / 7 (days) / 8 (hours) = 162.107 though, so if human that person is making 162 comments an hour, 8 hours a day, 7 days a week?

shimman

11 days ago

Bro stop trying to deflate the boosters, they got wares to sell and shares to dump.

shagie

12 days ago

In some code that I was working on, I had

    // stuff
    obj.setSomeData(something);
    // fifteen lines of other code
    obj.setSomeData(something);
    // more stuff

The 'something' was a little bit more complex, but it was the same something with slightly different formatting.

My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.

It also caught a repeat call in

    List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList();
    
    // ...

    var something = someFunc(objs);

    Thingy someFunc(List<Objs> param) {
        return param.stream().filter(o -> o.field.isPresent()). ...

Where one of the filter calls is unnecessary... and it caught that across a call boundary.

So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).

realusername

12 days ago

I had a similar review by AI except my equivalent of setSomeData was stateful and needed to be there in both places, the AI just didn't understand any of it.

james_marks

12 days ago

When this happens to me it makes me question my design.

If the AI doesn’t understand it, chances are it’s counter-intuitive. Of course not all LLM’s are equal, etc, etc.

wolletd

12 days ago

Then again, I have a rough idea on how I could implement this check with some (language-dependent) accuracy in a linter. With LLM's I... just hope and pray?

realusername

12 days ago

I'd agree with that but in the JS world, there's a lot of questionable library designs that are outside of my control.

frde_me

12 days ago

My reaction in that case is that most other readers of the codebase would probably also assume this, and so it should be either made clearer that it's stateful, or it should be refactored to not be stateful

uoaei

12 days ago

I'd say I see one anecdote, nothing to draw conclusions from.

fcarraldo

12 days ago

Why isn’t `obj` immutable?

shagie

12 days ago

Because 'obj' is an object that was generated by a json schema and pulled in as a dependency. The pojo generator was not set up to create immutable objects.

tayo42

12 days ago

Unit tests catch that kind of stuff

shagie

12 days ago

The code works perfectly - there is no issue that a unit test could catch... unless you are spying on internally created objects to a method and verifying that certain functions are called some number of times for given data.

tayo42

12 days ago

Sure and you can do that

shagie

12 days ago

Trying to write the easiest code that I could test... I don't think I can without writing an excessively brittle test that would break at the slightest implementation change.

So you've got this Java:

    public List<Integer> someCall() {
        return IntStream.range(1,10).boxed().toList();
    }

    public List<Integer> filterEvens(List<Integer> ints) {
        return ints.stream()
                .filter(i -> i % 2 == 0)
                .toList();
    }

    int aMethod() {
        List<Integer> data = someCall();
        return filterEvens(data.stream().filter(i -> i % 2 == 0).toList()).size();
    }

And I can mock the class and return a spied'ed List. But now I've got to have that spied List return a spied stream that checks to see if .filter(i -> i % 2 == 0) was called. But then someone comes and writes it later as .filter(i -> i % 2 != 1) and the test breaks. Or someone adds another call to sort them first, and the test breaks.

To that end, I'd be very curious to see the test code that verifies that when aMethod() is called that the List returned by SomeCall is not filtered twice.

What's more, it's not a useful test - "not filtered twice" isn't something that is observable. It's an implementation detail that could change with a refactoring.

Writing a test that verifies that filterEvens returns a list that only contains even numbers? That's a useful test.

Writing a test that verifies that aMethod returns back the size of the even numbers that someCall produced? That's a useful test.

Writing a test that tries to enforce a particular implementation between the {} of aMethod? That's not useful and incredibly brittle (assuming that it can be written).

nl

12 days ago

You are correct and the objection is just completely invalid. There's no way anyone would or should write tests like this at the client level.

I think they are just arguing for the sake of arguing.

tayo42

12 days ago

You mention the tools you can use to make it happen.

I think we're at the point where you need concrete examples to talk about whether it's worth it or not. If you have functions that can't be called twice, then you have no other option to test details in the implementation like that.

Yeah there's a tradeoff between torturing your code to make everything about it testable and enforce certain behavior or keeping it simpler.

I have worked in multiple code bases where every function call had asserts on how many times it was called and what the args were.

shagie

11 days ago

In functions that you write, that might be possible.

How would you assert that a given std::vector only was filtered by std::ranges::copy_if once? And how would you test that the code that was in the predicate for it wasn't duplicated?

How would you write a failing test for this function keeping the constraint that you are working with std::vector?

    std::vector<int> doThing(const std::vector<int>& nums) {
        std::vector<int> tmp1;
        std::vector<int> tmp2;
        std::ranges::copy_if(data,
                             std::back_inserter(tmp1),
                             [](int n) { return n % 2 == 0; }
        );
        std::ranges::copy_if(tmp1,
                             std::back_inserter(tmp2),
                             [](int n) { return n % 2 == 0; }
        );
        return tmp2;
    }

tayo42

10 days ago

I know how I would do it in python. This is built into the stdlibs testing library, with mocks.

Maybe dependency injection and function pointers for the copy if function. Then you can check the call counts in your tests. But idk the cpp eco system and what's available there to do it.

shagie

10 days ago

The python code would be

    def some_call():
        return [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    
    def print_evens(nums):
        for n in filter(lambda n: n % 2 == 0, nums):
            print(n)
    
    def func():
        filtered = list(filter(lambda n: n % 2 == 0, some_call()))
        print_evens(filtered)
    
    if __name__ == "__main__":
        func()

How would you write a failing test that prevents the list from some_call() from having the same filter applied to it twice?

tayo42

10 days ago

You would use something like this

https://docs.python.org/3/library/unittest.mock.html#unittes...

Then you would make the mock filter with patch and test the `func` function

psuedo python code would be

   @patch(builtins.filter)
   def test_func_filter_calls(mock_filter):
     mock_filter.return_value = [2,4,6]
     func()
     mock_filter.assert_called_once_with([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

shagie

10 days ago

That wouldn't fail though. It was called only once with [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The second time it was called with [2, 4, 6, 8, 10].

Likewise, if some_call returned [2, 4, 6, 8, 10] instead, it should only be called with [2, 4, 6, 8, 10] once then.

However, the purpose of this test then becomes questionable. Why are you testing implementation details rather than observable? Is there anything that you could observe that depended on the filter being called once or twice with the same filter function?

tayo42

10 days ago

Did you try it? If it doesn't work there's also called once if you scroll up on the doc

And as far as whether it's a good idea or not, I generally wouldn't, but was saying when it is important then you do have these tools available,llms aren't the first thing to check for these mistakes. It's up to the engineer to choose between trade offs for your scenario.

shagie

10 days ago

Yes. The test passes. https://imgur.com/a/4qlTKlc

To try to monkey patch this in, you would need to also assert that it wasn't called with [2, 4, 6, 8, 10].

At which point, I would again ask "why are you testing that it _wasn't_ called with a given set of values?"

The comment at the root of this is "Unit tests catch that kind of stuff".

... But unit tests aren't for testing internals of implementation but rather observable aspects of a function.

Consider if the code was written so that it was

    def print_evens(nums):
        for n in nums:
            if n % 2 == 0:
                print(n)

instead (with the filter being used in func())

This isn't something that unit tests can (or should) identify. It would come out in a code review that there is redundant functionality in func and print_evens.

Using ChatGPT or another tool to assist in doing code reviews can be helpful (my original premise).

https://chatgpt.com/share/697a64a6-33c0-8011-a0f8-ca4fec74ab...

ChatGPT properly identifies the duplicated functionality (even though the code is using different idioms for doing the filtering for even numbers).

tayo42

10 days ago

The test pass because your patching print_even so the 2nd filter is never called.

Which I guess, idk maybe think through the testing more and your code more before jumping to conclusions about how things are?

Testing is one tool you have, and it can test the internal like this. Obviously there's a use for it if its in the Python stdlib

This is in mockito

https://stackoverflow.com/questions/39452438/mockito-how-to-...

this is in google testing library for cpp

https://google.github.io/googletest/gmock_for_dummies.html

> Specify your expectations on them (How many times will a method be called? With what arguments? What should it do? etc.).

If you never heard of this, I guess you learned something new? Im not a tutor though. I would read the docs more and experiment. Maybe chatgpt can help you with how tests can be written.

shagie

10 days ago

With Mockito, I can mock the returned result of someCall().

However, it also means mocking list.stream() and mocking the Stream for stream.filter() and mock the call stream.toList() to return a new mocked object that has those mocks on it again.

I could catch the object passed in to printEven(...) but that has no history on it to see if filter was called on it before.

Trying to do the filter(...) call would be especially hard since you'd be parameterize it with a code block.

And all this returns to "is this a useful test?"

Testing should only be done on the observable parts of the function. Does printEven only print even numbers?

The tests that you are proposing are testing the implementation of those calls to work in a specific way. "It must call filter" - but if it's changed to a different filter or if it's changed to not use a filter but has the same functionality the code breaks.

Inefficient? Yes. Bad? Yes. Wrong - no. And not being wrong it isn't something that a unit test could validate without going unnecessarily into the implementation of the internals for the method. Internals changing while the contract remains the same is perfectly acceptable and shouldn't be breaking a unit test.

anthonypasq96

12 days ago

youre verifying std lib function call counts in unit tests? lmao.

tayo42

12 days ago

You can do that with mocks if it's important that something is only called once, or likely there's some unintended side effect of calling it twice and tests woukd catch the bug

anthonypasq96

12 days ago

i know you could do it, im asking why on earth you would feel its vital to verify stream.filter() was called twice in a function

user

12 days ago

[deleted]

noitpmeder

12 days ago

You're not verifying the observable behavior of your application? lmao

shagie

12 days ago

How would you suggest tests around:

    void func() {
        printEvens(someCall().stream().filter(n -> n % 2 == 0).toList());
    }

    void printEvens(List<Integer> nums) {
        nums.stream().filter(n -> n % 2 == 0).forEach(n -> System.out.println(n));
    }

The first filter is redundant in this example. Duplicate code checkers are checking for exactly matching lines.

I am unaware of any linter or static analyzer that would flag this.

What's more, unit tests to test the code for printEvens (there exists one) pass because they're working properly... and the unit test that calls the calling function passes because it is working properly too.

Alternatively, write the failing test for this code.

tayo42

12 days ago

Idk how exactly to do it in cpp becasue I'm not familiar with the tooling

You could write a test that makes sure the output of someCall is passed directly to printeven without being modified.

The example as you wrote is hard to test in general. It's probably not something you would write if your serious about testing.

shagie

12 days ago

In C++, the code would look like:

    #include <vector>
    #include <iostream>
    #include <algorithm>

    std::vector<int> someCall()
    {
        return {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    }

    void printEvens(const std::vector<int>& nums)
    {
        std::ranges::for_each(nums, [](int n)
        {
            if (n % 2 == 0)
            {
                std::cout << n << '\n';
            }
        });
    }

    int main()
    {
        std::vector<int> data = someCall();
        std::vector<int> tmp;

        std::ranges::copy_if(data,
                             std::back_inserter(tmp),
                             [](int n) { return n % 2 == 0; }
        );
    
        printEvens(tmp);
        return 0;
    }

---

Nothing in there is wrong. There is no test that would fail short of going through the hassle of creating a new type that does some sort of introspection of its call stack to verify which function its being called in.

Likewise, identify if a linter or other static analysis tool could catch this issue.

Yes, this is a contrived example and it likely isn't idiomatic C++ (C++ isn't my 'native' language). The actual code in Java was more complex and had a lot more going on in other parts of the files. However, it should serve to show that there isn't a test for printEvens or someCall that would fail because it was filtered twice. Additionally, it should show that a linter or other static analysis wouldn't catch the problem (I would be rather impressed with one that did).

From ChatGPT a code review of the code: https://chatgpt.com/share/69780ce6-03e0-8011-a488-e9f3f8173f...

wtetzner

11 days ago

> You could write a test that makes sure the output of someCall is passed directly to printeven without being modified.

But why would anyone ever do that? There's nothing incorrect about the code, it's just less efficient than it should be. There's no reason to limit calls to printEven to accept only output from someCall.

quietbritishjim

12 days ago

A redundant filter() isn't observable (except in execution time).

You could pick it up if you were to explicitly track whether it's being called redundantly but it'd be very hard and by the time you'd thought of doing that you'd certainly have already manually checked the code for it.

anthonypasq96

12 days ago

what happened to not testing implementation details?

gherkinnn

12 days ago

Opus 4.5 catches all sorts of things a linter would not, and with little manual prompting at that. Missing DB indexes, forgotten migration scenarios, inconsistencies with similar services, an overlooked edge case.

Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.

It doesn't replace a human reviewer.

I don't see the point of paying for yet another CI integration doing LLM code review.

philipwhiuk

12 days ago

Currently attempting to get GitLab Duo's review featured enabled as a 'second pair of eyes'. I agree 100% that it's not replacing a human review.

I would on the whole prefer a 'lint-style' tool to catch most stuff because they don't hallucinate.

But obviously they don't catch everything so an LLM-based review seems like an additional useful tool.

storystarling

12 days ago

I came to the same conclusion and ended up wiring a custom pipeline with LangGraph and Celery. The markup on the SaaS options is hard to justify given the raw API costs. The main benefit of rolling it yourself seems to be the control over context retrieval—I can force it to look at specific Postgres schemas or related service definitions that a generic CI integration usually misses.

philipwhiuk

12 days ago

Personally I'm hoping that once the bubble bursts and hardware improvement catches up, we start seeing reasonable prices for reasonable models on SaaS platforms that are not scary for SecOps.

Not guaranteed though of course.

roncesvalles

12 days ago

Exactly. This is like buying a smoothie blender when you already have an all-purpose mixer-blender. This whole space is at best an open-source project, not a (multiple!) whole company.

It's very unlikely that any of these tools are getting better results than simply prompting verbatim "review these code changes" in your branch with the SOTA model du jour.

guelo

12 days ago

All those llm wrapper companies make no sense.

ohyoutravel

12 days ago

You’ve found the smoking gun!

justapassenger

12 days ago

AI code review to me is similar to AI code itself. It's good (and constantly getting better) at dealing with mundane things, like - is the list reversed correctly? Are you dealing with pointers correctly? Do you have off by 1 issues?

Where they suck is high level problems like - is the code actually solving the business problem? Is it using right dependencies? Does it fit into broader design?

Which is expected for me and great help. I'm more happy as a human to spend less time checking if you're managing lifecycle of the pointer correctly and focus on ensuring that code is there to do what it needs to do.

The_Fox

12 days ago

I installed CodeRabbit for our reviews in GitLab and am pretty happy with the results, especially considering the low price ($15/user/mo I think).

It regularly finds problems, including subtle but important problems that human reviewers struggle to find. And it can make pretty good suggestions for fixes.

It also regularly complains about things that are possible in theory but impossible in practice, so we've gotten used to just resolving those comments without any action. Maybe if we used types more effectively it would do that less.

We pay a lot more attention to what CodeRabbit says than what DeepSource said when use used it.

cbovis

12 days ago

GH Copilot is definitely far better than just a linter. I don't have examples to hand but one thing that's stood out to me is its use of context outside the changes in the diff. It'll pull in context that typically isn't visible in the PR itself, the sort of things that only someone experienced in the code base with good recall would connect the dots on (e.g. this doesn't conform to typical patterns, or a version of this is already encapsulated in reusable code, or there's an existing constant that could be used here instead of the hardcoded value you have).

bartread

12 days ago

I don't know that I fully agree with that. I use Copilot for AI code review - just because it's built in to GitHub and it's easy - and I'd say results are variable, but overall decent.

Like anything else AI you need to understand what you're doing, so you need to understand your code and the structure of your application or service or whatever because there are times it will say something that's just completely wide of the mark, or even the polar opposite of what's actually the case. And so you just ignore the crap and close the conversation in those situations.

At the same time, it does catch a lot of bugs and problems that fall into classes where more traditional linters really miss the mark. It can help fill holes in automated testing, spot security issues, etc., and it'll raise PRs for fixes that are generally decent. Sometimes not but, again, in these cases you just close them and move on.

I'd certainly say that an AI code review is better than no code review at all, so it's good for a startup where you might be the only developer or where there are only one or two of you and you don't cross over that much.

But the point I actually wanted to get to is this: I use Copilot because it's available as part of my GitHub subscription. Is it the best? I don't know. Does it add value with zero integration cost to me? Yes. And that, I suspect, is going to make it the default AI code review option for many GitHub subscribers.

That does leave me wondering how much of a future there is for AI code review as a product or service outside of the hosting platforms like GitHub and Gitlab, and I have to imagine that an absolutely savage consolidation is coming.

vimda

12 days ago

Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience

ChadNauseam

11 days ago

I haven't used the bug bot, but I like asking claude code to just review my PR in the command line. Yesterday it found a bug in a data structure I was implementing (it didn't support ZSTs properly). Of course, the fix it suggested was completely wrong, but what are ya gonna do. Still saved me from embarrassing myself before asking for a review

athrowaway3z

12 days ago

> The SOTA isn't capable of using a code diff as a jumping off point.

Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md

It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.

storystarling

11 days ago

I suspect this is primarily a unit economics problem. To get context beyond the diff you really need the full repository or a robust AST, but the token costs to load that state for every PR make the margins impossible right now.

victorbjorklund

12 days ago

They 100% catch bugs in code I work on. Is it replacing human review fully? No, not yet. But it is a useful tool. Just like most of us wouldn’t do a code review without having tests, linters etc run first.

RetpolineDrama

12 days ago

>The SOTA isn't capable of using a code diff as a jumping off point.

The low quality of HN comments has been blowing my mind.

I have quite literally been doing what you describe every working day for the last 6+ months.

ahmadyan

12 days ago

Problem with Code Review is it is quite straightforward to just prompt it, and the frontier models, whether Opus or GPT5.2Codex do a great job at code-reviews. I don't need second subscription or API call when the first one i already have and focus on integration works well out of the box.

In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.

zurfer

11 days ago

we use codex review. it's working really well for us. but i don't agree that it's straightforward. moving the number of bugs catched and signal to noise ratio a few percentage points is a compounding advantage.

it's a valuable problem to solve, amplified by the fact that ai coding produces much more code.

that being said, i think it's damn hard to compete with openai or anthropic directly on a core product offering in the long run. they know that it's an important problem and will invest accordingly.

rushingcreek

12 days ago

Greptile is a great product and I hope you succeed.

However, I disagree that independence is a competitive advantage. If it’s true that having a “firewall” between the coding agent and review agent leads to better code, I don’t see why a company like Cursor can’t create full independence between their coding and review products but still bundle them together for distribution.

Furthermore, there might well be benefits to not being fully independent. Imagine if an external auditor was brought in to review every decision made inside your company. There would likely be many things they simply don’t understand. Many decisions in code might seem irrational to an external standalone entity but make sense in the broader context of the organization’s goals. In this sense, I’m concerned that fully independent code review might miss the forest for the trees relative to a bundled product.

Again, I’m rooting for you guys. But I think this is food for thought.

dcre

11 days ago

It's a ridiculous argument. Complete gibberish. They're all just API calls. There's no incentive to be biased toward approving bad code.

jackb4040

11 days ago

Came here to post this. Plus they're all just calls to the same LLM providers under the hood anyways

sebra

11 days ago

I've tried Greptile and it's pretty much pure noise. I ran it for 3 PRs and then gave up. Here are three examples of things it wasted my time on in those 3 PRs:

* Suggested to silence exception instead of crash and burn for "style" (the potential exception was handled earlier in code but it did not manage to catch that context). When I commented that silencing the exception could lead to uncaught bugs it replies "You're absolutely right, remove the try-catch" which I of course never added * Us using python 3.14 is a logic error as "python 3.14 does not exist yet" * "Review the async/await patterns Heavy use of async in model validation might indicate these should be application services instead." whatever this vague sentence means. Not sure if it is suggesting us changing the design pattern used in our entire code base.

Also the "confidence" score added to each PR being 4/5 or something due to these irrelevant comments was a really annoying feature IMO. In general AI tools giving a rating when they're wrong feels like a big productivity loss as then the human reviewer will see that number and think something is wrong with the PR.

Before this we were running Coderabbit which worked really well and caught a lot of bugs / implementation gotchas. It also had "learnings" which it referenced frequently so it seems like it actually did not repeat commenting on intentional things in our code base. With Coderabbit I found myself wanting to read the low confidence comments as well since they were often useful (so too quiet instead of too noisy). Unfortunately our entire Coderabbit integration just stopped working one day and since then we've been in a long back and forth with their support.

I'm not sure what the secret sauce is but it feels like Greptile was GPT 3.5-tier and Coderabbit was Sonnet 4.5-tier.

bjackman

11 days ago

My experience is that basic generic agents are useless but an agent with extensive prompting about your usecase is extremely valuable.

In my case using these prompts:

https://github.com/masoncl/review-prompts

Took things from "pure noise" to a world where, if you say there's a bug in your patch, people's first question will be "has the AI looked at it?"

FWIW in my case the AI has never yet found _the_ bug I was hunting for but it has found several _other_ significant bugs. I also ran it against old commits that were already reviewed by excellent engineers and running in prod. It found a major bug that wasn't spotted in human review.

Most of the "noise" I get now just leads me to say "yeah I need to add more context to the commit message". E.g the model will say "you forgot to do X" when X is out of scope for the patch and I'm doing it in a later one. So ideally the commit messages should mention this anyway.

Dylan-CodeRab

11 days ago

I am a member of the CodeRabbit tech support team, would you be able to provide me the ticket number you have open with us? I'd be happy to get this escalated internally so we can get this resolved for you ASAP.

sebra

11 days ago

Thanks Dylan. Turns out my colleague actually had a teams call with you yesterday and the issue was confirmed and prioritised on your end.

You have a great product. Looking forward to get back to using it!

pavan_panto

10 days ago

[dead]

DavidYoussef

a day ago

The article nails the core issue but I think misdiagnoses the solution space.

The problem isn't that AI code review exists - it's that current tools are solving the wrong problem. They review code that humans wrote. The actual crisis is reviewing code that AI wrote.

  When AI increases code volume by 10x but reviewer count stays flat, you
  don't need better review tools. You need risk triage. Not every PR deserves
  the same attention:

  - Typo fix to a README? L0. Auto-approve with an evidence log.
  - New utility function with tests? L1. One model scans it, posts findings.
  - Changes to auth middleware or payment flow? L3. Three models have to reach
    consensus before a human even looks at it.
  - Production deployment config? L4. Models + mandatory human sign-off.

  We've been building this (codeguard-action on GitHub, MIT licensed) - a
  GitHub Action that classifies PR risk, runs multi-model review proportional
  to that risk, and produces a cryptographic evidence bundle proving what was
  checked. The evidence is hash-chained and independently verifiable offline
  with a separate tool.

  The point isn't to replace human reviewers. It's to stop burning them out
  on L0-L1 changes so they have capacity for the L3-L4 ones that actually
  matter.

  The 786-PR-backlog problem mentioned upthread isn't a review problem. It's
  a triage problem.

atomicnature

11 hours ago

AI code review has genuinely helpful - especially when we generate code with copilot, etc.

Many times, these GenAI tools can delete/modify code mistakenly.

I use LiveReview's git precommit features - so the review happens right before I commit code automatically. And it has saved me many (100s of) times.

Give LiveReview's Precommit checks a try.

raincole

12 days ago

I still think any business that is based on someone else's model is worthless. I know I'm sounding like the 'dropbox is just FTP' guy, but it really feels like that any good idea will just be copied by OpenAI and Anthropic. If AI code review is proven a good idea is there any reason to expect Codex or Claude Code to not implement some commands to do code review?

allreduce

11 days ago

The thing is, the "Dropbox is just FTP" guy should be right most of the time when you are selling to experts.

There is no reason to not just ask Claude for a review and then distill this into PR comments. Especially because "every LLM output has to be filtered through a human" is a good policy with the current state of these tools.

However, this industry loves distilling frivolities into web tools and it sells for some unfathomable reason. It is the same with the existing static analyzers etc that some orgs pay for. I do not understand why.

sthuck

12 days ago

Very very strictly speaking relying on models in it's essence is not the problem I think. There is enough "meat" there you can build a nice small profitable company.

Those tools are better than vanilla agents by dedicating expensive human time on evaluating and fine tuning models. You can also build various integration, management and reporting features to add value. If you freeze model progress today, or 12 months ago when most of those companies started, it's a viable business I think.

But any gains you make on the first part will be lost to newer models, and the 2nd part is not as valuable when llms allow people to build fairly complicated features quickly.

I don't if worthless but all those companies have very limited time to gather customers and at least make themselves valuable for an acquisition

vrighter

9 days ago

You cannot be profitable unless the service you rely on is also profitable. You might make some profits during their honeymoon period, but then they will squeeze you by the balls pretty soon and force you to enshittify as well.

WinRAR is more profitable than OpenAI...

bluGill

12 days ago

The shakiest business model is one where you have no competition - if nobody else had the idea already: you are probably wrong - they did but it was a bad idea so they failed.

The real question is how can you compete. There are lots of answers here, but something new and good is rare.

tokioyoyo

11 days ago

We do our review through Claude github actions. Works well.

cbovis

12 days ago

I've also noticed this explosion of code review tools and felt that there's some misplaced focus going on for companies.

Two that stood out to me are Sentry and Vercel. Both have released code review tools recently and both feel misplaced. I can definitely see why they thought they could expand with that type of product offering but I just don't see a benefit over their competition. We have GH copilot natively available on all our PRs, it does a great job, integrates very well with the PR comment system, and is cheap (free with our current usage patterns). GH and other source control services are well placed to have first-class code review functionality baked into their PR tooling.

It's not really clear to me what Sentry/Vercel are offering beyond what copilot does and in my brief testing of them didn't see noticeable difference in quality or DX. Feels like they're fighting an uphill battle from day one with the product choice and are ultimately limited on DX by how deeply GH and other source control service allow them to integrate.

What I would love to see from Vercel, which they feel very well placed to offer, is AI powered QA. They already control the preview environments being deployed to for each PR, they have a feedback system in place with their Vercel toolbar comments, so they "just" need to tie those together with an agentic QA system. A much loftier goal of course but a differentiator and something I'm sure a lot of teams would pay top dollar for if it works well.

cmrdporcupine

12 days ago

"While some other products have built out great UIs for humans to review code in an AI-assisted paradigm, we have chosen to build for what we consider to be an inevitable future - one where code validation requires vanishingly little human participation."

Ok good, now I know not to bother reading through any of their marketing literature, because while the product at first interested me, now I know it's exactly not what I want for my team.

The actual "bubble" we have right now is a situation where people can produce and publish code they don't understand, and where engineers working on a system no longer are forced to reckon with and learn the intricacies of their system, and even senior engineers don't gain literacy into the very thing they're working on, and so are somewhat powerless to assess quality and deal with crisis when it hits.

The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.

What we got instead with claude code & friends is a thing way too eager to take over the whole thing. And while it can produce some good results it doesn't produce understandable systems.

To be clear, it's been a long time since writing code has been the hard part of the job? in many many domains. The hard part is systems & architecture and while these tools can help with that, there's nothing more potentially terrifying pthan a team full of people who have agentically produced a codebase that they cannot holistically understand the nuances of.

So, yeah, I want review tools for that scenario. Since these people have marketed themselves off the table... what is out there?

jacobegold

12 days ago

Yep. We see this future and are working on exactly what you're talking about (Graphite)

cmrdporcupine

12 days ago

You just completely contradicted yourself then.

jacobegold

12 days ago

Not sure how? Meant this:

> The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.

dullcrisp

12 days ago

> we have chosen to build for what we consider to be an inevitable future - one where code validation requires vanishingly little human participation.

personjerry

12 days ago

I don't really understand how this differentiates against the competition.

> Independence

Any "agent" running against code review instead of code generation is "independent"?

> Autonomy

Most other code review tools can also be automated and integrated.

> Loops

You can also ping other code review tools for more reviews...

I feel like this article actually works against you by presenting the problem and inadequately solving them.

dakshgupta

12 days ago

> Independence

It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?

> Autonomy

Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.

> Loops

You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.

sdenton4

12 days ago

Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.

It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.

rohansood15

12 days ago

> Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?

You're assuming models/prompts insist on a previous iteration of their work being right. They don't. Models try to follow instructions, so if you ask them to find issues, they will. 'Trust' is a human problem, not a model/harness problem.

> Our view is that code validation will be completely autonomous in the medium term.

If reviews are going to be autonomous, they'd be part of the coding agent. Nobody would see it as an independent activity, you mentioned above.

> Our first step towards making this easier is a native Claude Code plugin.

Claude can review code based on a specific set of instructions/context in an MD file. An additional plugin is unnecessary.

My view is that to operate in this space, you gotta build a coding agent or get acquired by one. The writing was on the wall a year ago.

liamconnell

12 days ago

> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.

Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”

“Vibes” (2/3 on scale) are ok, just honestly curious.

TuringTest

12 days ago

>A human rubber-stamping code being validated by a super intelligent machine is the equivalent of a human sitting silently in the driver's seat of a self-driving car, "supervising".

So, absolutely necessary and essential?

In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.

pavan_panto

10 days ago

[dead]

themafia

12 days ago

> Unfortunately, code review performance is ephemeral and subjective

> Today's agents are better than the median human code reviewer

Which is it? You cannot have it both ways.

maxverse

12 days ago

> Today's agents are better than the median human code reviewer

"...at catching issues and enforcing standards, and they're only getting better".

I took this to mean what good code review is is subjective. But if you clearly define standards and patterns for your code, your linter/automated tools/ AI code reviewer will always catch more than humans.

kaishin

12 days ago

We used Greptile where I work and it was so bad we decided to switch to Claude. And even Claude isn’t nearly as good at reviewing as an experienced programmer with domain knowledge.

cmrdporcupine

12 days ago

My experience is that Claude or others are good at pointing out things I will want to look at and then I can go review more thoroughly. So it's helped to some degree.

But like everything else with it, it tries to do too much.

What I want is a review "wizard" agent -- something that identifies the pieces I should look at, and takes me through them diff by diff asking me to read them, while offering its commentary ("this appears to be XX....") and letting me make my own.

pawelduda

12 days ago

Good code reviews are part of team's culture and it's hard to just patch it with an agent. With millions of tools it will be arms race between which one is louder about as many things as possible because:

- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,

- it will sometimes find real big issue which reinforces the bias that it's useful

- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?

So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive

Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.

quanwinn

12 days ago

I liked that the post is self-aware that it's promoting its own product. But the writing seemed more focus on the philosophy behind code reviews and the impact of AI, and less on the mechanics of how greptile differs from competitors. I was hoping to see more on the latter.

dakshgupta

12 days ago

Thanks! We go over that on many other pages. Here are some:

https://www.greptile.com/benchmarks https://www.greptile.com/greptile-vs-coderabbit https://www.greptile.com/greptile-vs-bugbot

geooff_

12 days ago

This article has a catchy headline, but there's really no content to it. This is content marketing without content. It seems like every week on Hacker News, there's a dozen of these. All seemingly code reviewers, too. Keep it to LinkedIn.

MichaelRo

12 days ago

[flagged]

simbleau

12 days ago

After testing several bots in our org, specifically Devin, Graphite, and Cursor, I’ve noticed Cursor is the better bug bot out there right now.

coopykins

11 days ago

Same here, tested a bunch and cursor has been given little noise and usually decent suggestions. In this case its on a react app, so other projects might not find it as good.

sastraxi

12 days ago

Contrary to some of the other anecdotes in this thread, I've found automated code review to discover some tricky stuff that humans missed. We use https://www.cubic.dev/

aurareturn

12 days ago

Before I push any code, I always ask 2 different frontier LLMs to review the changes for any potential issues. Saved my ass a few times before pushing to production.

pomarie

12 days ago

Founder of cubic here, thanks for the shoutout!

veunes

11 days ago

The main problem with current AI reviewers isn't catching bugs, it's shutting up when there is no bug. Humans have an intuitive filter like "this code is weird, but it works and won't break prod, so I'll let it slide". LLMs lack this, they generate 20 comments about variable naming and 1 comment about a critical race condition. As a result the developer gets fatigue and ignores everything. Until AI learns to understand the context of importance, not just code context, it will remain an expensive linter

clarus

11 days ago

What should be added, I think, to code reviewing is that it can get really complex, for example if we add formal verification in the mix to catch very subtle bugs.

So in the end I think there will still be some disappointment, as one would expect it should be fully automated and only about reading the code, like this article suggests. In reality, I think it is harder than writing code.

jv22222

11 days ago

We built an internal code review tool at the day job and are getting pretty good results with it (CLI tool).

Here's a summary of the top-level ideas behind it. Hope it's helpful!

Core Philosophy

- "Advisor, not gatekeeper" - Every issue includes a "Could be wrong if..." caveat because context matters and AI can't see everything. Developers make the final call.

(Just this idea makes it less annoying and stops devs going down rabbit holes because it it pretty good at thinking why it might be wrong)

- Prompt it to be critical but not pedantic - Focus on REAL problems that matter (bugs, security, performance), not style nitpicks that linters handle.

- Get the team to run it on the command line just before each commit. Small, focused reviews not after batching 10 commits. Small diffs get better feedback.

Smart Context Gathering

- Full file contents, not just diffs - The tool reads complete changed files plus 1-level-deep imports to understand how changed code interacts with the codebase.

Prompt Engineering

- Diff-first, context-second - The diff is marked as "REVIEW THIS" while context files are explicitly marked "DO NOT REVIEW - FOR UNDERSTANDING ONLY" to prevent false positives on unchanged code. BUT that extra context makes a huge difference in correctness.

- Structured output format - Emoji-prefixed bullets ( Critical, Major, Minor), max 3 issues per section, no fluff or praise.

- Explicit "Do NOT" list - Prevents common AI review mistakes: don't flag formatting (Prettier handles it), don't flag TypeScript errors (IDE shows them), don't repeat issues across files, don't guess line numbers.

Final note

- Also plugged it in to a github action for last pass, but again non blocking.

just6979

10 days ago

If you need the AI to indicate "could be wrong" on everything it writes to prevent your devs from blindly following everything it says, you're doing it so wrong. That should be the default mindset. Of course it could be wrong.

jv22222

7 days ago

Not quite.

The could be wrong part is very helpful because it has (on multiple occasions) dug up something that was long-lost-to-lore about why something should work in a non conventional way.

Without that, the advice looks perfectly sensible and would send devs down a Rabbit hole, because the AI recommendation "looks right".

segmondy

12 days ago

If you give LLM a hammer everything looks like a nail, you give it a saw everything looks like wood. You ask LLM to find issues, it will find "issues" At the end of the day, you will have to fix those issues, if you decide to have another LLM fix those issues, by the time you are done with that cycle, you are going to end up with code that will be thoroughly over engineered.

heliumtera

12 days ago

If by engineering you mean doing whatever vibes you feel, than yeah, over engineering.

If by engineering you mean using the engineering design process than it would not be engineered at all, let alone over engineered.

__0x01

11 days ago

Is "AI code review" a correct term?

A code review requires reasoning and understanding, things that to my knowledge a generative model cannot do.

Surely the most an AI code review ever could be is something that looks like a code review.

rat9988

11 days ago

Given we are more interested in the end than in the mean, it is a good usage.

__0x01

11 days ago

Please can you elaborate? We are more intersted in "the end" in what sense?

disillusionist

12 days ago

My company just finished a several week review period of Greptile. Devs were split over the usefulness of the tool (compared to our current solution, Cursor). While Greptile did occasionally offer better insights than Cursor, it also exhibited strange behavior such as entirely overwriting PR descriptions with its own text and occasionally arguing with itself in the comments. In the end we decided to NOT purchase Greptile as there were enough "not quite there" issues that made it more trouble than worthwhile. I am certain, though, that the Greptile team will resolve all those problems and I wish them the best of luck!

trjordan

12 days ago

1. I absolutely agree there's a bubble. Everybody is shipping a code review agent.

2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.

More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.

The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.

lenerdenator

12 days ago

> More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.

> The point of a PR is to share knowledge and to catch structural gaps.

Well, it was to share knowledge and to catch structural gaps.

Now you have an idea, for better or for worse, that software needs to be developed AI-first. That's great for the creation of new code but as we all know, it's almost guaranteed that you'll get some bad output from the AI that you used to generate the code, and since it can generate code very fast, you have a lot of it to go through, especially if you're working on a monorepo that wasn't architected particularly well when it was written years ago.

PRs seem like an almost natural place to do this. The alternative is the industry finding a more appropriate place to do this sort of thing in the SDLC, which is gonna take time, seeing as how agentic loop software development is so new.

dakshgupta

12 days ago

2. There is plenty of evidence for this elsewhere on the site, and we do encourage people to try it because like with a lot of AI tools, YMMV.

You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.

ahmadyan

12 days ago

> 2. What on earth is this defense of their product?

i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.

randusername

12 days ago

This article surprised me. I would have expected it would be about how _human_ code review is unsustainable in the face of AI-enhanced velocity.

I would be interested to hear of some specific use-cases for LLMs in code review.

With static analysis, tests, and formatters I thought code review was mostly interpersonal at this point. Mentorship, ensuring a chain of liability in approvals, negotiating comfort levels among peers with the shared responsibility of maintaining the code, that kind of thing.

alittletooraph2

12 days ago

Either become a platform or get swallowed up by one (e.g. Cursor acquiring Graphite to become more of a platform). Trying to prove out that your code review agent is marginally better than others when the capability is being included in every single solution is a losing strategy. They can just give the capability away for free. Also, the idea that code review will scale dramatically in importance as more code is written by agents is not new.

nickitolas

12 days ago

> In addition, success is generally pretty well-defined. Everyone wants correct, performant, bug-free, secure code.

I feel like these are often not well defined? "Its not a bug it's a feature", "premature optimization is the root of all evil", etc

In different contexts, "performant enough" means different things. Similarly, many times I've seen different teams within a company have differing opinions on "correctness"

kachapopopow

11 days ago

I can't get over how every single code rabbit ad was some incorrectly classified bug / completely wrong to begin with or pointless at best.

taude

12 days ago

It's not terribly hard to write a Copilot GHA that does this yourself for your specific teams needs. Not sure why you'd been to bring a vendor on for this....

What do the vendors provide?

I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....

Manfred

11 days ago

Fuzzy automated reviews should always run in an interactive loop with a developer on their workstation and contain enough context to quickly assess if they are valid or not.

When developers create a PR, they already feel they are "done", and they have likely already shifted their focus on another task. False positive are horrible at this point, especially when they keep changing with each push of commits.

jackconsidine

12 days ago

> Only once would you have X write a PR, then have X approve and merge it to realize the absurdity of what you just did.

I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason

maxverse

12 days ago

Maybe I'm buying into the cool-aid, but I actually really liked the self-aware tone of this post.

> Based on our benchmarks, we are uniquely good at catching bugs. However, if all company blogs are to be trusted, this is something we have in common with every other AI code review product. One just has to try a few, and pick the one that feels the best.

mohsen1

12 days ago

So far I've been pretty happy with Greptile. Tried Copilot and Cubic.dev but landed on Greptile

ex-aws-dude

12 days ago

I find a lot of times with co-pilot it calls out issues where if the AI had more context of the whole codebase it would realize that scenario can’t actually occur.

Or it won’t understand some invariant that you know but is not explicit anywhere

rrhjm53270

12 days ago

Why not let AI write the code and then have it reviewed by humans? If you use AI to review my code, then you can't stop me from using another AI to refute it: this only foreshadows the beginning of internal friction.

user

12 days ago

[deleted]

bp93592203

11 days ago

totally agree. Looks like the most common problem with the bubble is the terrible signal to noise ratio. Has anyone found a solution that works well? I see augment code is claiming their review agent is the best in terms of signal to noise ratio

https://www.augmentcode.com/blog/we-benchmarked-7-ai-code-re...

has anyone tried it?

anon7000

12 days ago

I’ve found only one good code review bot, and that’s Unblocked. It doesn’t always leave a comment, and when it does, it’s often found 1-2 real bugs in the code crossing multiple files (even like “hey you forgot to update this reference in this other file not edited in the PR”). Things you’d expect someone with a deeper knowledge of the code to know.

You do get a handful of false positives, especially if what it reports is technically correct, but we’re just handling the issue in a sort of weird/undocumented way. But it’s only one one comment that’s easy to dismiss, and it’s fairly rare. It’s not like huge amounts of AI vomit all over PRs. It’s a lot more focused.

sidgarimella

12 days ago

where we draw the line on agent "identity" when the models being orchestrated are generally the same 3 frontier intelligences is an interesting question indeed

I would think this idea of creating a third-party to verify things likely centers more around liability/safety cover for a steroidal increase in velocity (i.e. --dangerously-skip-permissions) rather than anything particularly pragmatic or technical (but still poised to capture a ton of value)

iblaine

12 days ago

I had a bad experience with greptile due to what seemed to be excessive noise and nit comments. I have been using cursorbot for a year and really like it.

Fervicus

12 days ago

LLMs writing code, and then LLMs reviewing the code. And when customers run into a problem with the buggy slop you just churned out, they can talk to a LLM chat bot. Isn't it just swell?

dullcrisp

12 days ago

Just let the support chat bot submit, review, and deploy code changes and there are no longer any customer problems!

pnathan

12 days ago

Claude code's code review is _sufficient_ imo.

still need HITL, but the human is shifted right and can do other things rather than grinding through fiddly details.

Yizahi

12 days ago

> This might seem far-fetched but the counterfactual is Kafkaesque.

> As the proprietors of an, er, AI code review tool suddenly beset by an avalanche of competition, we're asking ourselves: what makes us different?

> Human engineers should be focused only on two things - coming up with brilliant ideas for what should exist, and expressing their vision and taste to agents that do the cruft of turning it all into clean, performant code.

> If there is ambiguity at any point, the agents Slack the human to clarify.

Was this LLM advertisement generated by an LLM? Feels so at least.

the__alchemist

12 days ago

We have Code Rabbit at work, and it's made PRs unreadable. The Bun pollutes the comments and code diffs with noise.

user

12 days ago

[deleted]

0xbadcafebee

12 days ago

Hot take: Code review is an anti-pattern.

We spend a ton of time looking at the code and blocking merges, and the end result is still full of bugs. AI code review only provides a minor improvement. The only reason we do code review at all is humans don't trust that the code works. Know another way to tell if code works? Running it. If our code is so utterly inconceivable that we can't make tests that can accurately assess if the code works, then either our code design is too complicated, or our tests suck.

OTOH, if the reason you're doing code review is to ensure the code "is beautiful" or "is maintainable", again, this is a human concern; the AI doesn't care. In fact, it's becoming apparent that it's easier to replace entire sections of code with new AI generated code than to edit it.

veunes

11 days ago

Running the code checks if it works now, whereas code review checks if it will work in a year and if anyone else can understand it.

Tests don't catch architectural mistakes or time bombs. If you remove reviews and rely solely on tests, you end up with a "working" big ball of mud that is impossible to maintain. AI won't help if it's the one generating the mud.

devnullbrain

10 days ago

This is how you end up with O(n^2) algorithms that 'work' until you run it with production-size data.

If the context your code runs in is small enough to be in a test, you're probably not working on anything serious anyway.

insin

12 days ago

Tests can't tell you if the design of the code is fit for purpose, or about requirements you completely missed or punted on, or that a core new piece that's going to be built upon next is barely-coherent, poorly-performing slop that "works" but is going to need to be actually designed while being rewritten by the next person instead, or that you skipped trying to understand how the feature should work or thinking about the performance characteristics of the solution before you started and just let the LLM drive, so you never designed anything, arriving at something which "works" on your machine and passes the tests which were generated for it, but will hammer production under production loads. Neither will running it on your own machine or in Dev.

No amount of telling the LLM to "Dig up! Make no mistakes!" will help with non-designed slop code actively poisoning the context, but you have to admire the attempt when you see comments added while removing code, referring to the code that's being removed.

It's weird to see tickets now effectively go from "ready for PR" to 0% progress, but at least you're helping that person meet whatever the secret AI* usage quota is for their performance review this year.

0xbadcafebee

11 days ago

> Tests can't tell you if the design of the code is fit for purpose, or about requirements you completely missed or punted on

This is what acceptance tests are for. Does it do the thing you wanted it to do? Design a test that makes it do the thing, and check the result matches what you expect. If it's not in the test, don't expect it to work anywhere else. Obviously this isn't easy, but that's why we either need a different design or different tests. Before that would have been a tremendous amount of work, but now it's not.

(Making this work requires learning how to make it work right. This is a skill with brand-new techniques which 99.999% of people will need over year to learn)

> or that a core new piece that's going to be built upon next is barely-coherent, poorly-performing slop that "works" but is going to need to be actually designed while being rewritten by the next person instead

This is the "human" part I mentioned being irrelevant now. AI does not care if the code is slop or maintainable. AI can just rewrite the entire thing in an hour. And if the tests pass, it doesn't matter either. Take the human out of the loop.

(Concerned about it "rewriting tests" to pass them? You need independent agents, quality gates, determinism, feedback loops, etc. New skills and methods designed to keep the AI on the rails, like a psychotic idiot savant that can build a spaceship if you can keep it from setting fire to it)

> or that you skipped trying to understand how the feature should work or thinking about the performance characteristics of the solution before you started and just let the LLM drive, so you never designed anything

This is not how AI driven coding works. You have to give the AI very specific design instructions. If you do it right, it will make what you want. Sadly, this means most programmers today will be irrelevant because they can't design their way out of a wet paper bag.

(You know how agile eschews planning and documentation, telling developers and product people to just build "whatever works right now" and keep rewriting it indefinitely as they meet blockers they never planned for? AI now encourages the planning and documentation.)

seanmccann

12 days ago

As Claude Code (and Opus) improves, Greptile is finding fewer issues in my code reviews.

m3kw9

11 days ago

"review my code for edge cases" should pop it

las3r

11 days ago

I would suggest you check out your Greptile discord and/or answer your messages on X where people are trying to reach you with problems and questions about your service. Unless that no longer matters.

tfarias

12 days ago

My experience with code review tools has been dreadful. In most cases I can remember the reviews are inaccurate, "you are absolutely right" sycophantic garbage, or missing the big picture. The worst feature of all is the "PR summary" which is usually pure slop lacking the context around why a PR was made. Thankfully that can be turned off.

I have to be fair and say that yes, occasionally, some bug slips past the humans and is caught by the robot. But these bugs are usually also caught by automated unit/integration tests or by linters. All in all, you have to balance the occasional bug with all the time lost "reviewing the code review" to make sure the robot didn't just hallucinate something.

heliumtera

12 days ago

No shit. What is the point of using an llm model to review code produced by an llm model?

Code review pressupose a different perspective, which no platform can offer at the moment because they are just as sophisticated as the model they wrap. Claude generated the code, and Claude was asked if the code was good enough, and now you want to be in the middle to ask Claude again but with more emphasis, I guess? If I want more emphasis I can ask Claude myself. Or Qwen. I can't even begin to understand this rationale.

ottah

12 days ago

> Today's agents are better than the median human code reviewer at catching issues

Not my experience

> A human rubber-stamping code being validated by a super intelligent machine

What? I dunno how they define intelligence, but LLMS are absolutely not super intelligent.

> If agents are approving code, it would be quite absurd and perhaps non-compliant to have the agent that wrote the code also approve the code.

It's all the same frontier models under the hood. Who are you kidding.

hathym

11 days ago

article by greptile, the AI code reviewer :D

dcreater

12 days ago

Reminder that this comes from from the founder that got rightly lambasted for his comments about work life balance and then doubled down when called out.

h1fra

12 days ago

one more ai code review please, I promise it will fix everything this time, please just one more

EGREF

11 days ago

4GVFDGDGFFFFEGRFEDS

dzonga

12 days ago

or stick with known frameworks documented - so you don't have to pay for this nonsense

since they're likely telling you things you know if you test and write your own code.

oh - writing your own code is a thing of the past - a.i writes, a.i then finds bugs

dcreater

12 days ago

There is an AI bubble.

Can drop the extra words

https://news.ycombinator.com/newsguidelines.html

Edit: Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

pavan_panto

10 days ago

[dead]