Show HN: Letting LLMs Run a Debugger

144 pointsposted 6 days ago
by mohsen1

30 Comments

mohsen1

3 days ago

To people who do this sort of thing:

Does generating synthetic data using this make sense for RL of models to be be really good at debugging code? Currently all LLMs inhaled all of code in the world but the data is only text of the code (maybe plus changes that fixed bugs etc) but the amount of insight that can be generated by actually running the code and getting the runtime values, step-by-step is almost infinite.

Is this sort of data useful for training LLMs?

marxism

3 days ago

Not only that, measurably improves coding ability too (according to our tests finetuning llama 2).

Back in 2023 we sold a training dataset (~500B tokens but we could have gen more) with exactly this kind of data. The dataset was a bunch of ~1-3KB text file examples with a code snippet, then we would show some variable values when the program was at line X, then ask the LLM to predict some more print statements at line Y.

If you wanted to train on stack traces or control flow graphs we offered that too.

Our mvp used rr to attach to large programs like Excel and Chrome, then used a heuristic to filter samples for more "interesting" examples. The theory at the time was sure LLMs can learn a few information theory bits from many noisy examples, but why not spend a little intelligence on our end and knock out a couple low hanging "noise" sources. We used a prolog engine to find traces where all the information required to predict the final memory address was present in the initial program description. This turned out not to matter too much because the LLM would learn from non deterministic examples too (shrug). Eventually we ended up with a monkey patched chromium browser wandering around the internet to collect javascript examples.

We sold datasets, and we also offered an on-prem agent where you could generate training examples on demand using the spare CPU cycles on GPU nodes.

It seemed like a beautiful idea at the time but sputtered out because every serious outfit we talked to would ask a bunch of questions and seemed to come to the conclusion that they would rather do it in house.

I would bet dollars to donuts that most of the AI scraping load that people are complaining about has nothing to do with grabbing the text content of their forum and more to do with executing their javascript.

Speaking from experience: we didn't really understand V8 internals so we relied on continuous navigation and reloading the page to drive more execution rather than something smarter: snapshotting or other more fine grained manipulation of VM state. Our training data harvesting bot would reload web pages over and over to take more samples rather than do something more efficient with isolate snapshots.

Edit: email in my profile if anyone wants to talk about this. I'm feeling a wave of nostalgia thinking about this project again. Despite commercial failure, it was arrestingly/dangerously interesting.

silveraxe93

3 days ago

I'd be extremely surprised if AI labs are not doing or planning on doing this already.

The same way that reasoning models are trained on chain of thoughts, why not do it with program state?

Just have a "separate" scratchpad where the AI keeps the expected state of the program. You can verify if that is correct or not. Just use RL to train the AI to always have that correct.

hunterbrooks

3 days ago

Yes. My business is in the code review space, synthetic data is very helpful for evals

llm_trw

3 days ago

Probably. You don't know until someone tries.

melvinroest

3 days ago

This would be amazing with Smalltalk/Pharo or a similar language where the concept of debugging is a first class citizen (I guess it's the same for certain Lisp languages?)

koito17

2 days ago

In the case of Common Lisp, yes. Other lisps (e.g. Clojure) don't really have interactive debugging, value inspection, live disassembly, or anything like the condition system (which allows programmatic access to the debugger and automatically recovering from certain conditions).

The compiler, debugger, and runtime are always present. When optimizing Common Lisp code, I found it very useful to add types, refactor structs, recompile functions, then disassemble the functions and see if the compiler generated efficient machine code. This is all natively supported by the runtime. Editor tooling does little more than create a nice-looking UI. Doing this in Clojure is not really possible, since it's hard to guess what HotSpot will do with a given sequence of JVM bytecode.

jasonjmcghee

5 days ago

Nice! I recently had and built the same idea using MCP (in order to be client / LLM agnostic) and VS Code (DAP would be even better, but haven't tried tackling it).

https://github.com/jasonjmcghee/claude-debugs-for-you

mohsen1

5 days ago

That's a really cool project. Why did you stop developing this? For me, it's a lot of work and I don't have bandwidth to productionize this

jasonjmcghee

3 days ago

Wow this really blew up since I last commented! Congrats! If you're interested in developing this further, I encourage you to check out how I made it language agnostic- instead of only Node.js. it's really not much additional effort

jasonjmcghee

4 days ago

I wouldn't say i stopped developing it- there are always features that could be added, but it serves the purpose it's meant to! You can directly interact with an LLM and it can work just like any other chat, but can also perform its own debugging / investigations

flembat

a day ago

Really great idea, I currently work for an AI, compiling and debugging its code, at least that's what it sometimes feels like. Who is the agent here exactly? The fact that the AI has no understanding at all of what we are doing, and does not apply the information it does know to solve problems is challenging. At least if it debugged the code, it would be able to see that it is clobbering the same registers that it is using, instead of me having to explain that to it. Fortunately I am talking about my hobby projects, I pity people who are doing this for a living now.

emeryberger

2 days ago

Nice UI. We started on a project that does this about 2 years ago. ChatDBG (https://github.com/plasma-umass/ChatDBG), downloaded about 70K times to date. It integrates into debuggers like `lldb`, `gdb`, and `pdb` (the Python debugger). For C/C++, it also leverages a language server, which makes a huge difference. You can also chat with it. We wrote a paper about it, should be published shortly in a major conference near you (https://arxiv.org/abs/2403.16354). One of the coolest things we found is that the LLM can leverage real-world knowledge to diagnose errors; for example, it successfully debugged a problem where the number of bootstrap samples was too low.

dboreham

2 days ago

Hopefully work like this has the side effect of people making debuggers work again. In my experience they seldom do these days (except in old-school tech like C, C++, golang). Presumably because the younger folks were told in college that debugging wasn't necessary. I don't mean debuggers just don't run, rather that they're sufficiently broken that they're not worthwhile using. Perhaps an LLM that adds print statements to code and reads the output would be more in keeping with the times?

ericb

3 days ago

Very cool concept! There's a lot of potential in reducing the try-debug--fix cycle for LLMs.

On a related note, here's a Ruby gem I wrote that captures variable state from the moment an Exception is raised. It gets you non-interactive text-based debugging for exceptions.

https://rubygems.org/gems/enhanced_errors

K0IN

3 days ago

Hei this is lovely,

i created a extension to help me debug a while back [0], and i thought of this (ai integration) for a long time, but did not have the time to tackle this.

Thank you so much for sharing!

I might need to add this approach to my extension as well!

[0] https://github.com/K0IN/debug-graph

bravura

3 days ago

A time-traveling debugger for Python + LLM would be amazing.

mettamage

3 days ago

Tangent: I now want a 10 hour YouTube video where an LLM gets stuck in some debugging reasoning loop and the loop is just recorded for 10 hours.

Preferably with some lofi music under it.

jasonjmcghee

3 days ago

This isn't difficult to do, once you have debugging capabilities exposed to the llm in vs code. You just need the proper launch.json and to expose "step back".

crest

3 days ago

I like the first paragraph of the README clearly stating that this is your research project instead of making a lot of grandiose claims.

stuaxo

4 days ago

Nice, I did this manually once with ipdb, just cutting and pasting the text to LLM and having it tell me which variables to inspect and what to press.

bwhiting2356

2 days ago

Thank you for this. Would love to see it integrated into copilot or cusor

codenote

5 days ago

Interesting experiment! This feels like it could really expand the potential applications of LLMs. Exciting to see how AI can assist in debugging with live runtime context!

maeil

3 days ago

Extremely LLM-like writing style, do you translate your comments through one or something?

bbarnett

2 days ago

Do not let it debug itself!!

jbmsf

3 days ago

Honestly, this is the first LLM concept that makes me want to change my workflow. I don't use vscode but I'm excited by the idea.

jasonjmcghee

2 days ago

Theoretically there's no reason you need to use vscode, it's just very easy to make extensions for.