candiddevmike
an hour ago
None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
shagie
26 minutes ago
In some code that I was working on, I had
// stuff
obj.setSomeData(something);
// fifteen lines of other code
obj.setSomeData(something);
// more stuff
The 'something' was a little bit more complex, but it was the same something with slightly different formatting.My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.
It also caught a repeat call in
List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList();
// ...
var something = someFunc(objs);
Thingy someFunc(List<Objs> param) {
return param.stream().filter(o -> o.field.isPresent()). ...
Where one of the filter calls is unnecessary... and it caught that across a call boundary.So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).
dakshgupta
42 minutes ago
I agree that none perform _super_ well.
I would argue they go far beyond linters now, which was perhaps not true even nine months ago.
To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
blibble
3 minutes ago
> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
do you have a bot to do this too?
tadfisher
30 minutes ago
Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.
onedognight
33 minutes ago
I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.
boredtofears
16 minutes ago
That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything
written-beyond
25 minutes ago
I mean how far Rusts own clippy lint went before any LLMs was actually insane.
Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.
athrowaway3z
18 minutes ago
> The SOTA isn't capable of using a code diff as a jumping off point.
Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md
It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.
vimda
27 minutes ago
Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience