So isolation is correct. Forking a sandbox gives you multiple exact duplicates of isolated environments.
When your coding agent has 10 ideas for what to do, to evaluate them correctly it needs to be able to evaluate them in isolation.
If you're building a website testing agent and halfway down a website, with a form half filled out a session ongoing, etc and it realizes it wants to test 2 things in isolation, forking is the only way.
We also envision this powering the next generation of devcycles "AI Agent, go try these 10 things and tell me which works best". AI forks the environment 10 times, gets 10 exact copies, does the thing in each of them, evaluates it, then takes the best option.
The other way might be testing VMs vs agent VMs but that would be slower as to "fork" it would need to run the test again to that point. But wouldn't need agent context.
The forking you provided adds a lot more speed.
That + its not always simple to replicate state. A QA agent in the future could run for hours to trigger an edge case that if all actions to get there were theoretically taken again it wouldn't happen.
That can happen via race conditions, edge states, external service bugs.
Yep I can see this especially when the agent is spinning up test servers/smokes and you don't want those conflicting. How do we reconcile all the potential different git hashes though, upstream I guess etc (this might be an easy answer and I'm not super proficient with git so forgive)
So we recommend branch per fork, merge what you like.
You have to change the branch on each fork individually currently and thats unlikely to change in the short term due to the complexity of git internals, but its not that hard to do yourself `git checkout -b fork-{whateverDiscriminator}`
Have you considered git worktree?
Great for simple things, but git worktrees don't work when you have to fork processes like postgres/complex apps.
For postgres there are pg containers, we use them in pytest fixtures for 1000's of unit-tests running concurrently. I imagine you could run them for integration test purposes too. What kind of testing would you run with these that can't be run with pg containers or not covered by conventional testing?
I'll say this is still quite useful win for browser control usecases and also for debugging their crashes.
Agreed, the thing I'd be most interested in is the isolated execution environment you mentioned. Agents running autopilot are powerful. Agents running unsupervised on a machine with developer permissions and certificates where anything could influence the agent to act on an attacker's behalf is terrifying
I recommend running the agent harness outside of the computer. The mental model I like to use is the computer is a tool the agent is using, and anything in the computer is untrusted.
I would recommend not giving an agent the full run of any computing environment. Do handle fine grained internet access controls and credential injection like OpenShell does?
I used to believe this, but I think the next generation of agents is much more autonomous and just needs a computer.
The work of a developer is open ended, so we use a computer for it. We don't try to box developers into small granular screwdrivers for each small thing.
Thats whats coming to all agents, they might want to run some analysis with python, want to generate a website/document in typescript, and might want to store data in markdown files or in MongoDB. I expect them to get much more autonomous and with that to end up just needing computers like us.
The difference is that I am not always legally liable for what a rogue developer does with their computer - if I had no knowledge of what they were up to and had clear policies they violated then I'm probably fine. But I'm definitely always liable for anything an agent I created does with the computer I gave it.
And while they are getting better I see them doing some spectacularly stupid shit sometimes that just about no person would ever do. If you tell an agent to do something and it can't do what it thinks you want in the most straightforward way, there is really no way to put a limit on what it might try to do to fulfill its understanding of its assignment.
The problem is the agent, which should be treated untrusted.
The computer isn’t the problem
Kind of. The chat logs of the agent are trustworthly, as should any telemetry you have on it or coming out of the VM. Its behavior should be treated as probabilistic and therefore untrustworthly.
It’s untrustworthy because its context can be poisoned and then the agent is capable of harm to the extent of whatever the “computer” you give it is capable of.
The mitigation is to keep what it can do to “just the things I want it to do” (e.g. branch protection and the like, whitelisted domains/paths). And to keep all the credentials off its box and inject them inline as needed via a proxy/gateway.
I mean, that’s already something you can do for humans also.