hackernews client

Rars: a Rust RAR implementation, mostly written by LLMs

83 pointsposted 11 hours ago

by davidsong

(bitplane.net)

79 Comments

trembolram

19 minutes ago

The same guy reimplemented Amiga LZX in Rust. https://bitplane.net/dev/rust/amiga-lzx/

He seems to like doing this kind of stuff.

dclavijo

10 hours ago

It wonders me that a coupe of days ago I did the same with Unique and a single skill.md, repo: https://github.com/daedalus/uniq-reconstruction, on succes I tried with rar but failed. Kudos

I have never attempted something so ambitious with AI, but this feels spot on in terms of experience. As you cede more control to the model, you will find yourself losing control on things like code quality and performance.

luckystarr

9 hours ago

You have to make performance part of the spec. It will then create benchmarks and plan differently. If you omit this, you get what you get.

p0w3n3d

2 hours ago

First, isn't RAR compression algorithm proprietary somehow?

Second, why compress to RAR if you can compress to 7z?

WithinReason

an hour ago

One thing it has built in that other algorithms don't is optional redundancy so it can recover data from a damaged archive

rebolek

10 hours ago

> "For the last 15 months or so my hobby has been shouting at Claude"

How can you shout at Claude when it’s

1) foobaring, bamblabooing and fghrtawing all the time without telling you what’s going on

2) when it finally interacts, it’s asking for a permission you told it 30 seconds ago "yes and do not ever ask me again until heat death of the Universe"

3) and after all of that, it just spits out: "you’re out of tokens, give up your liver or wait until next Trump’s war"

wolttam

7 hours ago

> foobaring, bamblabooing and fghrtawing all the time without telling you what’s going on

Oh man, now I have to plug my tool[0]... it doesn't hide anything, but by default tries to provide a pleasant interface (ctrl+o to toggle details similar to CC, but less janky?)

Disclaimer: It's way simpler than Claude Code or even pi (on purpose)

[0]: https://codeberg.org/mlow/lmcli

davidsong

3 hours ago

The proper way to work with Claude or Codex is, IMO, to load up the context with a discussion about what you're doing and why. You go back and forth, pushing back on its opinions and shaping the context until the tokens are ready to flow into the right shape. Every angle you miss is an opportunity for them to slop out all over the place, and, until Codex was mature, the longer you ran the task for, the more it'd spread out and lose shape.

Re-shaping the context sometimes involves severe pressures like "wtf is this ugly crap?" or "did I just spot you laying a turd in my codebase again?" and other strong forms of disapproval, mixed with "hmm not sure I like the sound of that"s, to "yea that's much better" to pull it back in the other direction.

The trick is to shape the flow before the tide comes in and you end up like King Canute

xphos

11 hours ago

Would it really take 5 years to develop rare compress and decompression that seems an extreme overestimate in time. I don't know of the compressor decompression but that seems really high

q3k

10 hours ago

Yeah, sounds closer to a 5 week thing, if you know what you're doing.

davidsong

4 hours ago

Well, it is every version of RAR. Documenting the quirks of rar 1.4, 1.5, 2.0, 2.9, 3.0, 4.0, 5.0 and 7.0, multiple compression strategies, PPMd, RARVM, compression levels, encryption, multi volume support, a huge test corpus, round trips for compatibility... The spec docs are linked.

self_awareness

10 hours ago

5 week is a decompressor for 1 version. If this supports multiple versions of RAR, then writing decompressors alone for all of them is probably a year effort of work.

esafak

11 hours ago

> It’s sloppy, it’s slow, it’s almost two megabytes in size and somewhat worse than WinRAR on compression.

As mathematicians say, optimization is left as an exercise to the reader. You did the hard part.

59nadir

9 hours ago

I mean, not really...? A vibecoded mess that runs badly, that's not really the hard part for something like compression/decompression tools.

esafak

9 hours ago

What's your easy way of reverse engineering every previous version of this file format, if you don't think that was the hard part?

no-name-here

7 hours ago

Wouldn’t previous human-coded implementations already be in LLMs’ training sets? OP article even mentions a number of previous implementations. And even recent LLM “clean room” designs seem to gloss over how the final LLM implementer still has access to previous implementations’ code.

Still, OP claims to have done the best job to date at creating (via AI) specs, and the non-optimal Rust implementation, so a net gain?

542458

6 hours ago

I don’t believe any of the extant open source rar implementations cover the range of features and versions OP’s does. I think that’s the point - OP’s isn’t the cleanest or fastest implementation, but it is the most broad open source version available.

themafia

11 hours ago

> But, it works, and the world now has a free software RAR implementation.

Does it? How are you legally intending to use copyright to license this machine output? How would you know it's not encumbered in any way?

falcor84

8 hours ago

In all seriousness, why should anyone care?

I always found software IP to be absurd, but this is a particularly absurd situation. We're talking here about a small utility tool implemented from scratch and open sourced, with no apparent intent to make any money from it.

Are you concerned about the "encumberence" of using "unlicensed" tools to manipulate .doc, or .pdf, or .mp3 files?! Well I'm not, and if anyone ever tried to sue me for improper access to their proprietary formats, I'll show them some old testament impropriety.

Georgelemental

7 hours ago

Judges tend to frown on old testament impropriety. And corporations tend to frown on employees who draw the ire of judges

davidsong

4 hours ago

I generally don't anyway. Since the WTFPL came out I've been licensing under that with a warranty clause (don't blame me).

My main goal here was an experiment to see how far I could push the technique, and learn things along the way. Regardless of whether people dare to use it commercially or not, we have interoperability for the foreseeable future. As an archivist/computing historian I think that's important.

perching_aix

11 hours ago

Really unsure why this is getting downvoted, to my understanding this is a massive, unsettled concern.

It wasn't even a disasm/pseudocode to formal spec flow, and then a separate human implementation. The same human has been in the loop throughout, and large parts of it were generated directly.

It's basically guaranteed tainted.

Edit: I should have skimmed a bit more patiently, there was in fact no "disasm/pseudocode + the human getting tainted" part to this apparently.

ameliaquining

10 hours ago

I read the post you're replying to as saying "this is copyright-encumbered and nonfree because it's a derivative work of everything in Claude's and GPT-5.5's training corpus", which is an argument I find fairly tiresome. (Realistically, if courts actually rule that this is the case, this tiny little project will be the least of anyone's concerns.)

"This is copyright-encumbered and nonfree because it's a derivative work of the legacy RAR binaries" is a different argument (and seems like it depends on details of the setup that were somewhat glossed over in the post).

Georgelemental

7 hours ago

I also am skeptical of the "LLM output is derivative of everything in the training corpus" argument in general, but in this specific case I think it may have more merit. If the model was trained on unrar source code, and obtained specific information about the RAR format from that code which it then used in the code generation step, then the output is arguably tainted because of that.

ameliaquining

6 hours ago

Does the source-available UnRAR do anything that the existing FOSS implementations can't do? IIUC the interesting part of this particular project is that it supports really old versions of the file format that were never publicly documented anywhere.

themafia

10 hours ago

The point is, excepting current legal standards which are already very murky, how can _you_ claim copyright, if you don't _know_ it isn't encumbered?

You can get these LLMs to generate copyrighted outputs both intentionally and accidentally. This is a known fact; therefore, if you're not checking the output to see if this has occurred then you're potentially generating legal risks for yourself and anyone who uses your code.

To not only ignore this for your own use case but to then release the code under a proclaimed license seems legally problematic if not ethically concerning.

If you did get sued for infringement I can't imagine that your defense would be that you find the argument tiresome? Honestly, do you think this would never happen, or how would you go about defending your actions here?

ameliaquining

9 hours ago

What do you mean by "checking the output"? Is there some kind of check the author says he didn't do that you think he should have? Or is your claim that using an LLM for coding is always copyright infringement? If so, I think the risk that I'll personally be the test case that resolves whatever ambiguities exist in the law is basically zero, and I don't think derailing the thread to be about that topic enlightens anyone.

themafia

8 hours ago

> What do you mean by "checking the output"?

At the very least you could see if it's already been open sourced under a different license. If you take GPL code and just slap MIT on it do you not consider that a violation?

> Or is your claim that using an LLM for coding is always copyright infringement?

I'm claiming you cannot really know.

> I'll personally be

It may be someone who uses or redistributes your code in any fashion.

> derailing the thread

I've made two posts. One with an idea and the second clarifying it. This is not "derailing the thread" under any sane definition. This is simply a complicated and relatively unexplored topic that clearly draws a lot of interest and resulting conversation from the crowd here.

I think using this type of bullying rhetoric damages that conversation and harms the reputation of Hacker News in general and I always regret it when I see it.

davidsong

4 hours ago

I didn't actually read any code. I generated spec documents using Claude, then later on used Codex to generate from the spec docs. Are the specs tainted? If someone else independently develops from my spec, is that also tainted? What if they hear it second hand? It's an interesting legal situation for sure.

gibspaulding

5 hours ago

> Really unsure why this is getting downvoted

Because it’s a boring argument that we’re not going to make progress on until it is actually tested in court.

Also, if/when this is is tested, the court’s options seem to be (a) say yeah this is fine, or (b) cause unending havoc that if followed through on would destroy the economy (a precedent that any org who’s proprietary code made it into ai training data could sue any org that was using code generated by that model? Do the math on how many suits that is.)

charcircuit

10 hours ago

The human wasn't looking at the copyrighted code and was giving high level steering instructions. If you look at the spec generated it doesn't look like a derivative work of the copyrighted material. The program was generated from the spec. It seems mostly fine from my perspective.

0cf8612b2e1e

9 hours ago

If I use a decompiler on existing binaries, then some machine translation utility to turn that into a different language, that still feels like a derivative work, even if no human were reviewing the specifics.

ameliaquining

8 hours ago

The idea is to make it so that the parts of the output that are derived from the existing binary are not themselves eligible for copyright protection. I.e., factual descriptions of the file format, without any implementation details from the binary.

rvz

3 hours ago

> but it works.

Are you sure "it works?"

slopinthebag

11 hours ago

How do we know it's actually correct?

davidsong

4 hours ago

I compressed thousands of files, went through libarchive's and Sembiance's test data at least for the decompressor side. I recompressed the files, and round-tripped them against 7zip, unrar, every later version of winrar. It failed a lot at the start, and codex burned a lot of tokens instrumenting the binaries and dividing and conquering until things settled down and round-trips worked properly.

I can't really say it works in every case as I honestly didn't spend that much time on it. But it works in the majority of cases. There's likely some nasty bugs hiding in there.

perching_aix

11 hours ago

By using it.

repelsteeltje

11 hours ago

It works == it's correct?

perching_aix

11 hours ago

Yes? What do you think fuzzing, unit testing, integration testing is for? It's an empirical evaluation of correctness. Literally just try and see.

For actual correctness verification in the strong sense, you'd need to start from a specification written in a formal language so that it's machine checkable, which if I had to guess not even win.rar GmbH has.

wavemode

10 hours ago

You're being needlessly dismissive.

From a philosophical perspective, there's no way to know that any piece of software is truly correct without formal verification.

But in the present, non-philosophical context, it's obvious that what we mean is, colloquially, "how well-tested is this against a variety of edge-case files which the official winrar handles correctly? Is there a test suite, and how robust is it? Plenty of software that claims to be compatible with the rar format, doesn't actually successfully read all rar files."

It's also equally obvious, in the present context, that we would prefer these steps to have been taken by the author of the software before we install it and run it on our own computers and data. The parent commenter wasn't just asking about the software's correctness for the sake of academic curiosity.

ameliaquining

8 hours ago

The post mentions the existence of an extensive test suite, which you can peruse for yourself if you're so inclined: https://github.com/bitplane/rars/tree/master/crates/rars-for...

I don't know how all these test cases were generated, but at least some of them seem to have been copied (with attribution) from the test suites of earlier FOSS RAR implementations.

The ideal would be to test it against a representative corpus of real-world legacy RAR files, but I'm not sure where you'd find one.

fragmede

8 hours ago

pirate bay

repelsteeltje

10 hours ago

I hope the developers of, say, the brakes in my car don't interpret 'software correctness' the way you do.

Added, later: hey you changed your comment, added a whole paragraph.

perching_aix

10 hours ago

I added the second paragraph about formal verification at the same time you posted, in anticipation that you'd immediately dig your heels into it otherwise, despite me highlighting that the other methods are merely empirical.

I was immediately proven right once I pressed "update". That said, I have now deleted my snarky response that followed. Not in the game of capitalizing off of the human equivalent of a race condition.

I should make a browser addon to delay posting, this is the 2nd time this happens in the past few days.

Edit:

Nevermind, it's already a feature built into the site. Turned it on. I wonder if it applies to edits also...

Nope, doesn't seem to. Oh well, should still help.

repelsteeltje

10 hours ago

Haha, off course! The three major sources of software failures: off by one errors and race conditions.

fragmede

8 hours ago

race off by one conditions

atiedebee

10 hours ago

I hope the brakes in my car don't need developers

pixl97

10 hours ago

I think you underestimate the complexity of modern braking systems.

arcticbull

10 hours ago

ABS doesn't just appear organically.

throw1234567891

10 hours ago

They used to. Now they have systems, standards, and experience. There are only so many ways you can do brakes on the car.

mjr00

11 hours ago

This is Rust we're talking about. It doesn't even need to work; as long as it compiles, it's correct.

speedgoose

10 hours ago

    use std::fs::File;
    use std::io::prelude::*;
    
    fn main() -> std::io::Result<()> {
        let mut file = File::create("content.txt")?;
        file.write_all(b"3!")?;
        Ok(())
    }

rakel_rakel

10 hours ago

; cat content.txt 3!;

dataflow

10 hours ago

> This is Rust we're talking about. It doesn't even need to work; as long as it compiles, it's correct.

No, it doesn't even need to compile. The mere fact that it's in Rust means it's correct.

slopinthebag

10 hours ago

Thus all software that can be used is correct?

You know what I meant: How can we have confidence that this implementation of RAR is functionally identical to what it's based on? What would give me the confidence to use it in a critical piece of infrastructure?

jaggederest

9 hours ago

Validating compression systems is usually really straightforward. There are 3 layers - decode known values from compressed files (or encode, same), round trip without any alterations, and fuzzing with arbitrary binaries

Because it's a defined format there can be binary exact comparisons between the input and output files - we already have an oracle in the form of proper RAR format software, so if they are identical, you don't need to look further for that specific case.

You can see a version of this that I did quite similarly, for postgresql wire format, here: https://github.com/pgdogdev/pgdog/tree/main/integration/sql

It validates that sql with the same setup, teardown, and test results in perfectly exact compatibility between raw postgresql as the control and various configurations of PgDog, with both the text format and binary format, so ultimately a 6-way multivariate test that should always result in binary-exact results.

slopinthebag

9 hours ago

Right, that's very different from "using it" and it's also different from "Have an LLM generate code that compiles".

perching_aix

9 hours ago

> Thus all software that can be used is correct?

You also know what I meant, since I spelled it out in more detail a comment later. But even though you're being facetious, yes, that really is the case. If it works it works. That's the bar for the vast, vast majority of software, and has been since forever. Demonstrated practical correctness. If you stumble into a bug, you log it as a defect and then either wait for a fix or fix it yourself depending. That's all that regular people ever have. In the case of this project, this was achieved via fuzz testing.

It's literally no different to e.g. validating the NTFS driver that ships in the Linux kernel, or validating any other (re)implementation of anything. You just do a bunch of empirical testing and hope for the best. It is also why reimplementations always lag behind, which I'm not suggesting is not a real concern (or that defects wouldn't be). It's just not a gotcha.

Hell, I'm 99% sure this is exactly what the actual vendor does too, or at least I sure hope that they do have tests at least. Cause they're sure as shit not using a formally verified compiler toolchain, meaning they definitely don't have a formal proof about whether even the official implementation in itself is correct. Only empirical data at best too.

slopinthebag

8 hours ago

> You just do a bunch of empirical testing and hope for the best.

I get that this is often the case, but it does feel like we should be able to do better. At least when humans write this code you can have the expectation that there was real intent behind making sure the semantics of the code are aligned with the specification. At least with current language models, they tend to just brute-force test suite acceptance until everything passes, in a way no human developer has the capacity for. Of course this is often how it works with humans too (i.e. the classic Oracle story), but it does feel wrong.

Can we be sure that this method has produced a correct artefact without years of extensive usage? Probably not, hence my reluctance to rely on something like this, at least initially.

perching_aix

7 hours ago

I do see your concern, even share in it, it's just not really tractable at its core.

There's a lot of chatter lately e.g. about using TLA+ for formal modeling, so that anything downstream can be formally proven. That helps, but then the formal model still needs to be crafted somehow, which means a pass of semantic interpretation.

Going from binary to spec mechanistically via formal proofs would be possible, but only if there was a formal spec for the binary structure and the ISA available. In practice, both are just natural language prose too however, meaning another interpretation pass or two. The ISA specs also keep a lot implementation-defined / undefined afaik, for microarchitecture-level optimization freedom.

Netlists, PDK, and the likes then might be public for some RISC-V designs these days, but to get the actual chip behavior requires EM simulation typically on a scale that is not possible for any chip performant enough to be of interest. And RISC-V is not a very broadly adopted platform for proprietary consumer software.

Having the human do the semantic mapping is expensive and legally stricken. Having an LLM do it is more risk, but way, way, cheaper and currently legally grey. And both can and do make mistakes.

This is why I see this so bleakly. That said, I do also think formats like this are delicate enough that even rudimentary empirical testing should provide a surprisingly decent behavioral coverage. There's a reason that "I can't believe anything ever works at all" is such a common sentiment. Practical usage is a surprisingly powerful gate, and fuzzing in particular is basically that on steroids.

I do nevertheless still secretly get the heebie-jeebies from the Linux NTFS implementation though (me bringing that up was no coincidence).

TacticalCoder

9 hours ago

I could be correct but way too slow in edge cases (unlikely with Rust but you never know), leaking temporary files, having security holes, etc.

There's much more about correctness of a piece of software than: "produces the same output as the original on x test cases".

I'm not saying it's a bad implementation and, if anything, LLMs are much better at translating/porting existing code (and finding bugs) than at writing things unheard of.

You're basically saying, if I may make a pun: "rust me bro, it's correct".

davidsong

4 hours ago

Yeah the main things are DoS attacks and path traversal issues. I intentionally guarded against these with resource limits and checks, but I can't guarantee that it's safe. I mean, basically anyone who carefully reads it knows more about it than me - you play the AI slot machine at this scale and who knows what prizes you'll win!

Imustaskforhelp

11 hours ago

Kudos, this is a really cool project (even if it might be AI generated), I have starred the repo, (3rd starrer here)

One thing I have been curious at is are there any ways to stop a rar compression mid way and then continue it later?

Like suppose I have a compression happening for a large file, then would there be a possibility with this project to shut down the computer mid compression and continue it after starting it again?

I would really love it if you can add this functionality!

davidsong

4 hours ago

Thanks!

I guess you could save the state to a file on SIGINT, flush what's been written and pick it back up again if the state file exists when you restart, and use the CRCs of the files to abort if things have changed. I don't fancy doing that for so many versions of RAR, but it would be a cool feature to add it to an `xz` fork. I like the idea.

Imustaskforhelp

a minute ago

Thanks it would still be interesting to see this added to xz but supposing the fact that LLM's were able to create the rars project, I suppose it might not be that difficult to add that to rar format eventually. Starting to do it from xz might make the most sense if you like the idea right though.

Another idea for rar format that I have which I would love to hear your opinion on is that there are sometimes multiple .part01 .part02 .part03 and so on

I have found that when you try to unrar it, it requires all the files at the particular.

It would be really beneficial imo if it was possible to have the ability if there was some ability to somehow just unrar .part01 without requiring all the contents of .part02,03 etc.

but from my very limited understanding, you also need some (I think last contents) of all files for the de-compression to work.

Would it be possible to do something of this endeavour so that you don't require all the parts themselves but just something of a patch of an end, I am not sure about compression algorithms if that might be possible though but it felt like something which was a bit possible albeit hard/difficult to do with rar format.

I would be curious to hear your opinions on it, and thanks for responding and I would be really interested in seeing the xz fork that you mentioned!

cactusplant7374

11 hours ago

> and it almost earned me an OpenAI ban

Were you flagged for a cybersecurity violation?

gibspaulding

11 hours ago

> Well, it turned out that at some time during spec investigation, Claude needed to understand authenticity verification which is a paid feature. With a context full of reverse engineering tools it cracked WinRAR and bypassed product registration, then dutifully documented its crimes in the spec. The docs, when viewed, triggered OpenAI’s alarms and stopped it dead in its tracks. I squashed this out of the git history, and decided not to implement the feature at all.

You can draw your own conclusions as to what this says about the state of agentic development.

periodjet

10 hours ago

Finally, a sane and enjoyable read about a coding project. Feel like it’s been months since we had one of these that wasn’t filled to the brim with bluesky/mastodon-flavored whining about AI.

Kudos to the author. A fun read, thank you for sharing.

RIMR

10 hours ago

For everyone out there whining about AI, there's one of you whining about being anti-AI.

Maybe just cut the unprompted whining?

perching_aix

10 hours ago

Would be great, but then it's a saturation game, and the other side doesn't have any compelling reason to hold back the same way. So it's contingent on how fair the platform is, and what nonverbal, out of band options remain.

HN is better than most in this regard thanks to community flagging, but even then there's a lot of it. Ultimately, it'd seem that the ratio you're describing skews a whole lot more towards the anti-ai sentiment side, than towards the anti-anti-ai one (or towards a stalemate). Or rather, that the latter sentiment is not common enough necessarily to thwart such comments. And so you see it reflected verbally instead.

sntran

7 hours ago

Good luck with keeping it online. Somebody built `rar-stream` with Rust, and its GitHub is no longer there.

npn

10 hours ago

Rar is proprietary. Good luck.

hayd

9 hours ago

https://law.stackexchange.com/a/83552

I suppose the question is whether the author had ever entered into a contract limiting reverse engineering...