mohsen1
3 days ago
To people who do this sort of thing:
Does generating synthetic data using this make sense for RL of models to be be really good at debugging code? Currently all LLMs inhaled all of code in the world but the data is only text of the code (maybe plus changes that fixed bugs etc) but the amount of insight that can be generated by actually running the code and getting the runtime values, step-by-step is almost infinite.
Is this sort of data useful for training LLMs?
marxism
3 days ago
Not only that, measurably improves coding ability too (according to our tests finetuning llama 2).
Back in 2023 we sold a training dataset (~500B tokens but we could have gen more) with exactly this kind of data. The dataset was a bunch of ~1-3KB text file examples with a code snippet, then we would show some variable values when the program was at line X, then ask the LLM to predict some more print statements at line Y.
If you wanted to train on stack traces or control flow graphs we offered that too.
Our mvp used rr to attach to large programs like Excel and Chrome, then used a heuristic to filter samples for more "interesting" examples. The theory at the time was sure LLMs can learn a few information theory bits from many noisy examples, but why not spend a little intelligence on our end and knock out a couple low hanging "noise" sources. We used a prolog engine to find traces where all the information required to predict the final memory address was present in the initial program description. This turned out not to matter too much because the LLM would learn from non deterministic examples too (shrug). Eventually we ended up with a monkey patched chromium browser wandering around the internet to collect javascript examples.
We sold datasets, and we also offered an on-prem agent where you could generate training examples on demand using the spare CPU cycles on GPU nodes.
It seemed like a beautiful idea at the time but sputtered out because every serious outfit we talked to would ask a bunch of questions and seemed to come to the conclusion that they would rather do it in house.
I would bet dollars to donuts that most of the AI scraping load that people are complaining about has nothing to do with grabbing the text content of their forum and more to do with executing their javascript.
Speaking from experience: we didn't really understand V8 internals so we relied on continuous navigation and reloading the page to drive more execution rather than something smarter: snapshotting or other more fine grained manipulation of VM state. Our training data harvesting bot would reload web pages over and over to take more samples rather than do something more efficient with isolate snapshots.
Edit: email in my profile if anyone wants to talk about this. I'm feeling a wave of nostalgia thinking about this project again. Despite commercial failure, it was arrestingly/dangerously interesting.
silveraxe93
3 days ago
I'd be extremely surprised if AI labs are not doing or planning on doing this already.
The same way that reasoning models are trained on chain of thoughts, why not do it with program state?
Just have a "separate" scratchpad where the AI keeps the expected state of the program. You can verify if that is correct or not. Just use RL to train the AI to always have that correct.
hunterbrooks
3 days ago
Yes. My business is in the code review space, synthetic data is very helpful for evals
llm_trw
3 days ago
Probably. You don't know until someone tries.