simonw
2 months ago
I think the most interesting thing about this is how it demonstrates that a very particular kind of project is now massively more feasible: library porting projects that can be executed against implementation-independent tests.
The big unlock here is https://github.com/html5lib/html5lib-tests - a collection of 9,000+ HTML5 parser tests that are their own independent file format, e.g. this one: https://github.com/html5lib/html5lib-tests/blob/master/tree-...
The Servo html5ever Rust codebase uses them. Emil's JustHTML Python library used them too. Now my JavaScript version gets to tap into the same collection.
This meant that I could set a coding agent loose to crunch away on porting that Python code to JavaScript and have it keep going until that enormous existing test suite passed.
Sadly conformance test suites like html5lib-tests aren't that common... but they do exist elsewhere. I think it would be interesting to collect as many of those as possible.
avsm
2 months ago
The html5lib conformance tests when combined with the WHATWG specs are even more powerful! I managed to build a typed version of this in OCaml in a few hours ( https://anil.recoil.org/notes/aoah-2025-15 ) yesterday, but I also left an agent building a pure OCaml HTML5 _validator_ last night.
This run has (just in the last hour) combined the html5lib expect tests with https://github.com/validator/validator/tree/main/tests (which are a complex mix of Java RELAX NG stylesheets and code) in order to build a low-dependency pure OCaml HTML5 validator with types and modules.
This feels like formal verification in reverse: we're starting from a scattered set of facts (the expect tests) and iterating towards more structured specifications, using functional languages like OCaml/Haskell as convenient executable pitstops while driving towards proof reconstruction in something like Lean.
leafmeal
2 months ago
This totally makes me thing of Martin Kleppmann's recent blog post about how AI will make verified software much easier to use in practice! https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...
yuppiemephisto
2 months ago
I’m doing similar with porting shellcheck Haskell -> Lean
Havoc
2 months ago
Was struggling yesterday with porting something (python->rust). LLM couldn't figure out what was wrong with rust one no matter how I came at it (even gave it wireshark traces). And being vibecoded I had no idea either. Eventually copied in python source into rust project asked it to compare...immediate success
Turns out they're quite good at that sort of pattern matching cross languages. Makes sense from a latent space perspective I guess
wuschel
2 months ago
Could you elaborate a bit on your example? What do mean by "that sort of pattern matching" and the argument of "latent space perspective?"
Thanks!
gwking
2 months ago
I’ve idly wondered about this sort of thing quite a bit. The next step would seem to be taking a project’s implementation dependent tests, converting them to an independent format and verifying them against the original project, then conducting the port.
skissane
2 months ago
Give coding agent some software. Ask it to write tests that maximise code coverage (source coverage if you have source code; if not, binary coverage). Consider using concolic fuzzing. Then give another agent the generated test suite, and ask it to write an implementation that passes. Automated software cloning. I wonder what results you might get?
gaigalas
2 months ago
> Ask it to write tests that maximise code coverage
That is significantly harder to do than writing an implementation from tests, especially for codebases that previously didn't have any testing infrastructure.
skissane
2 months ago
Give a coding agent a codebase with no tests, and tell it to write some, it will - if you don’t tell it which framework to use, it will just pick one. No denying you’ll get much better results if an experienced developer provides it with some prompting on how to test than if you just let it decide for itself.
joshstrange
2 months ago
This is a hilariously naive take.
If you’ve actually tried this, and actually read the results, you’d know this does not work well. It might write a few decent tests but get ready for an impressive number of tests and cases but no real coverage.
I did this literally 2 days ago and it churned for a while and spit out hundreds of tests! Great news right? Well, no, they did stupid things like “Create an instance of the class (new MyClass), now make sure it’s the right class type”. It also created multiple tests that created maps then asserted the values existed and matched… matched the maps it created in the test… without ever touching the underlying code it was supposed to be testing.
I’ve tested this on new codebases, old codebases, and vibe coded codebases, the results vary slightly and you absolutely can use LLMs to help with writing tests, no doubt, but “Just throw an agent at it” does not work.
lsaferite
2 months ago
This highlights something that I wish was more prevalent, Path Coverage. I'm not sure of what testing suites handle path coverage, but I know XDebug for PHP could manage it back when I was doing PHP work. Simple line coverage doesn't tell you enough of the story while path coverage should let you be sure you've tested all code paths of a unit. Mix that with input fuzzing and you should be able to develop comprehensive unit tests for critical units in your codebase. Yes, I'm aware that's just one part of a large puzzle.
skissane
2 months ago
But, did you actually give the agent access to a tool to measure code coverage?
If it can't measure whether it is succeeding in increasing code coverage, no wonder it doesn't do that great a job in increasing it.
Also, it can help if you have a pair of agents (which could even be just two different instances of the same agent with different prompting) – one to write tests, and one to review them. The test-writing agent writes tests, and submits them as a PR; the PR-reviewing agent read the PR and provides feedback; the test-writing agent updates the tests in response to the feedback; iterate until the PR-reviewing agent is satisfied. This can produce much better tests than just an agent writing tests without any automated review process.
gaigalas
2 months ago
Have you tried? Beyond the first tests, going all the way up to decent coverage.
pbowyer
2 months ago
I think I've asked this before on HN but is there a language-independent test format? There are multiple libraries (think date/time manipulation for a good example) where the tests should be the same across all languages, but every library has developed its own test suite.
Having a standard test input/output format would let test definitions be shared between libraries.
cr125rider
2 months ago
I’ve got to imagine a suite of end to end tests (probably most common is fixture file in, assert against output fixture file) would be very hard to nail all of the possible branches and paths. Like the example here, thousands of well made tests are required.
pplonski86
2 months ago
This is amazing. Porting library from one language to one language are easy for LLMs, LLMs are tired-less and aware of coding syntax very well. What I like in machine learning benchmarks is that agents develop and test many solutions, and this search process is very human-alike. Yesterday, I was looking into MLE-Bench for benchamrking coding Agents on machine learning tasks from Kaggle https://github.com/openai/mle-bench There are many projects that provide agents which performance is simply incredible, they can solve several Kaggle competitions under 24 hours and be on medal place. I think this is already above human level. I was reading ML-Master article and they describe AI4AI where AI is used to create AI systems: https://arxiv.org/abs/2506.16499
heavyset_go
2 months ago
This is one of the reasons I'm keeping tests to myself for a current project. Usually I release libraries as open source, but I've been rethinking that, as well.
simonw
2 months ago
Oddly enough my conclusion is the opposite: I should invest more of my open source development work in creating language-independent test suites, because they can be used to quickly create all sorts of useful follow-on projects.
heavyset_go
2 months ago
I'm not that generous with my time lol
cortesoft
2 months ago
Isn't the point that you might be one of the people who benefits from one of those follow on projects? That is kind of the whole point of open source.
Why are you making your stuff open source in the first place if you don't want other people to build off of it?
heavyset_go
2 months ago
> Why are you making your stuff open source in the first place if you don't want other people to build off of it?
Because I enjoy the craft. I will enjoy it less if I know I'm being ripped off, likely for profit, hence my deliberate choices of licenses, what gets released and what gets siloed.
I'm happy if someone builds off of my work, as long as it's on my own terms.
bgwalter
2 months ago
Open source has three main purposes, in decreasing order of importance:
1) Ensuring that there is no malicious code and enabling you to build it yourself.
2) Making modifications for yourself (Stallman's printer is the famous example).
3) Using other people's code in your own projects.
Item 3) is wildly over-propagandized as the sole reason for open source. Hard forks have traditionally led to massive flame wars.
We are now being told by corporations and their "AI" shills that we should diligently publish everything for free so the IP thieves can profit more easily. There is no reason to oblige them. Hiding test suites in order to make translations more difficult is a great first step.
inejge
2 months ago
> Hard forks have traditionally led to massive flame wars.
Provided that the project is popular and has a community, especially a contributor community (the two don't have to go together.) Most projects aren't that prominent.
visarga
2 months ago
I think the only non-slop parts of the web are: open source, wikipedia, arXiv, some game worlds and social network comments in well behaved/moderated communities. What do they share in common? They all allow building on top, they are social first, people come together for interaction and collaboration.
The rest is enshittified web, focused on attention grabbing, retention dark patterns and misinformation. They all exist to make a profit off our backs.
A pattern I see is that we moved on from passive consumption and now want interactivity, sociality and reuse. We like to create together.
nicoburns
2 months ago
If you don't trust the AI generated code yourself, then you wont benefit from it. And in fact all it does is take resources from the project that you work on, the one that's generating all the value in the first place.
There are strong parallels to the image generation models that generate images in the style of studio ghibli films. Does that benefit studio ghibli? I'd argue not. And if we're not careful, it will to undermine the business model that produced the artwork in the first place (which the AI is not currently capable of doing).
aadishv
2 months ago
I wonder if this makes AI models particularly well-suited to ML tasks, or at least ML implementation tasks, where you are given a target architecture and dataset and have to implement and train the given architecture on the given dataset. There are strong signals to the model, such as loss, which are essentially a slightly less restricted version of "tests".
montroser
2 months ago
We've been doing this at work a bunch with great success. The most impressive moment to me was when the model we were training did a type of overfitting, and rather than just claiming victory (as it all too often) this time Claude went and just added a bunch more robust, human-grade examples to our training data and hold out set, and kept iterating until the model effectively learned the actual crux of what we were trying to teach it.
aadishv
a month ago
That's genuinely impressive. Excited to see how the rapid progress will make it more and more autonomous in the future
simonw
2 months ago
I'm certain this is the case. Iterating on ML models can actually be pretty tedious - lots of different parameters to try out, then you have to wait a bunch, then exercise the models, then change parameters and try again.
Coding agents are fantastic at these kinds of loops.
aadishv
a month ago
A rudimentary form of self-improving intelligence :D
tracnar
2 months ago
If you're porting a library, you can use the original implementation as an 'oracle' for your tests. Which means you only need a way to write/generate inputs, then verify the output matches the original implementation.
It doesn't work for everything of course but it's a nice way to bug-for-bug compatible rewrites.
bzmrgonz
2 months ago
I see it as a learning or training tool for AI. The same way we use mock exams/tests, to verify our skill and knowledge absorption ans prepare for the real thing or career. This could one of many obstacles in an obstacle course which a coding AI would have to navigate in order to "graduate"
cies
2 months ago
This is an interesting case. It may be good to feed it to other model and see how they do.
Also: it may be interesting to port it to other languages too and see how they do.
JS and Py are but runtime-typed and very well "spoken" by LLMs. Other languages may require a lot more "work" (data types, etc.) to get the port done.
exclipy
2 months ago
Can you port tsc to go in a few hours?