hevalon
4 hours ago
Author here. The search algorithm was the easy part. The LLM already encodes domain knowledge from ML papers; it knows learning rate warmup helps with transformers, that batch size and learning rate are coupled. It converged on the winning GRPO config by iteration 1. Grid search needed 8 iterations.
The hard part was per-iteration GPU isolation. A botched run that leaves stale optimizer state or corrupted weights in memory will poison the next iteration. Each iteration needs a fresh CUDA runtime, fresh filesystem, fresh memory. No state leaks. That's where most of the engineering went; ephemeral containers with TTL-based cleanup, one A100 per iteration, torn down after metrics are emitted.
Happy to answer questions. Code: https://github.com/one-covenant/autoresearch-rl