Large Text Compression Benchmark

97 pointsposted 10 months ago
by redeux

55 Comments

hyperpape

10 months ago

It's worth noting that the benchmark has not been updated as frequently for the past several years, and some versions of compressors are quite far behind the current implementations (http://www.mattmahoney.net/dc/text.html#history).

The one instance I double-checked (zstd) I don't recall it making a massive difference, but it did make a difference (iirc, the current version was slightly smaller than what was listed in the benchmark).

pella

10 months ago

agree,

- in the benchmark "zstd 0.6.0" ( Apr 13, 2016 )

- vs. latest zstd v1.5.6 ( Mar 30, 2024 https://github.com/facebook/zstd/releases )

shaicoleman

10 months ago

- zstd v0.6.0 -22 --ultra: 215,674,670 bytes

- zstd v1.5.6 -22 --ultra: 213,893,168 bytes (~0.826% smaller)

adgjlsfhk1

10 months ago

how does the speed compare?

londons_explore

10 months ago

Only really worth updating if the results are meaningfully different though.

Since zstd's format is fixed, I doubt there will be massive changes in compression ratio.

namibj

10 months ago

Ehhh, it's the decompressor that's specified, and there's considerable wiggle room a'la Trellis modulation in radio line coding, due to the interplay of the dictionary coder with the entropy coder. Basically, there are many moments the dictionary coder has to make non-obvious decisions about how to code a sequence, and what matters is that the whole content overall has the least entropy, not the least length when fed to the entropy coder.

For simple codings like the TLV bitstream in QR codes (the L field bit length depends on the T and the QR code size (i.e., the max bitstream length possible))[0], you can afford to solve for all possibilities via e.g. dynamic programming with some mild heuristics to terminate search branches that can't possibly code shorter due to the TL overhead.

But with an entropy coder like zstd's fst/tANS, that's not remotely as trivial. Making a choice will change the symbol/byte histogram for the entropy coder's input sequence, which, if after quantization by tANS still different, can change the length of the coded bitstream. The problem is the non-local effect on a symbol's coding size from making any individual residency coding decision.

BTW, friendly reminder that all-caps URLs will code shorter into QR codes, so the code will have larger pixels or be smaller.

[0]: there are 5 main modes (i.e., T values): ([0-9])+ coded by stuffing 3 digits into 10 bit and trailing 2/1 digits into 7/4 bits, respectively; ([0-9]|[A-Z]|[:space:]|[\$\%*\+\−\/\.,\:])+ coded by stuffing 2 characters into 11 bits; ISO-8859-1 coded by stuffing one character into 8 bits; Kanji by stuffing each into 13 bits; and a ECI mode for iirc generic Unicode codepoints. I think the other 3 code points in T are used for legacy barcode FNC1 and FNC2, as well as an end-of-content marker.

adgjlsfhk1

10 months ago

I think the compression ratio will be a bit better, but the compression and decompression time will be a bunch better.

nick238

10 months ago

Double your compression ratio for the low, low price of 100,000x slower decompression (zstd: 215GB, 2.2 ns/byte vs. nncp: 106GB, 230 µs/byte)!

The neural network architectures are technically impressive, but unless there's some standard compression dictionary that works for everything (so the training/compression costs amortize down to nil), and silicon architecture dramatically changes to compute-on-memory, I don't know if this would ever take off. Lossy compression would probably provide huge advantages, but then you need to be domain specific, and can't slap it on anything.

abecedarius

10 months ago

The point of the contest was not compression tech for communication and storage. It was a belief that minimizing log loss on large corpora (i.e. message length) was key to AI progress. That should remind you of some pretty hot more-recent developments.

Here was someone else with a similar pov in the 2000s: https://danburfoot.net/research.html

optimalsolver

10 months ago

The issue with this paradigm (see also: Hutter Prize) was its insistence on lossless compression.

Intelligence is just as much about knowing what to throw away as what to keep. Any nontrivial cognitive system operating in a nontrivial physical environment will necessarily have a lossy model of the world it's embedded in.

Also of interest, compressionism, a theory of mind based on data compression:

https://ceur-ws.org/Vol-1419/paper0045.pdf

sdenton4

10 months ago

You can turn any lossy compression scheme into a lossless scheme by encoding the error... The lossy scheme should be aiming for low error, making the error encoding progressively smaller as the lossy scheme improves.

What's harder to deal with from a measurement perspective is sematic equivalence. calling some kinds of errors zero-cost, but not having a great way to categorize what the loss of, exactly. But it's kinda what you want for really extreme compression: the content is equivalent at a high level, but may be a very different byte stream.

igorkraw

10 months ago

Group theory is the abstraction you want

jebarker

10 months ago

Can you explain? I'm struggling to see how group theory is relevant here

igorkraw

9 months ago

This part

>What's harder to deal with from a measurement perspective is sematic equivalence. calling some kinds of errors zero-cost, but not having a great way to categorize what the loss of, exactly. But it's kinda what you want for really extreme compression: the content is equivalent at a high level, but may be a very different byte stream.

Is basically saying

> What's harder is defining a reconstruction process in terms of a "semantic group" i.e. an output encoding and associated group actions under which the loss is invariant, and having the group actions express the concept of two non-identical outputs being"equivalent for the purposes of downstream processing".

Taco Cohen is one of the pioneers of this line of research, and invariant, equivariant and approximately iv/ev architectures are a big thing in scientific and small data ml

sdenton4

9 months ago

"an output encoding and associated group actions under which the loss is invariant"

What loss is that, exactly?

One of the difficulties in speech processing is that we generally don't have a great model for speech quality or equivalence. The human hearing system is a finicky and particular beast, and furthermore varies from beast to beast, depending both on physiology and culture. Good measures (eg, Visqol, a learned speech quality metric) tend to be helpful for measuring progress when iterating on a single system, but can give strange results when comparing different systems.

So it 's easy to imagine (say) pushing the generated speech into some representation space and measuring nearness in that space (either absolute or modulo a group action), but it begs the question of whether nearness in that space really represents semantic equivalence, and how to go about constructing it in the first place.

Let alone why one would bother allowing some group symmetries into the representation when we plan to define a loss invariant under those symmetries... Throwing group theory at speech representations feels like a solution in search of a problem, as someone who has worked a lot with group theory and speech compression.

igorkraw

9 months ago

Group theory is just a way to think about what properties a function needs to have under specific actions on the data. If you want to train a speech network that should be peak-amplitude invariant (to make a random example) you can normalise the amplitudes, modify your loss to be invariant to it or modify the network output to be invariant to it. These might have different numerical tradeoffs (e.g. one of the reasons why people use equivariant architectures with a final invariant aggregator is that it allows each layer to use and propagate more information, and one the reasons why graph neural networks are a thing is because we don't always have a canonical labeling).

All the stuff you mentioned is true, but thinking about it in an abstract sense, that means there's a set of universal symmetries and a set of highly context depending symmetries, and group theory is afaik our best method of thinking about them rigorously - as I say in the child comment, not the end point, but our current best starting point (in my opinion)

jebarker

9 months ago

Thanks, I need to read more. I get how a semantic group could occur but it doesn't seem obvious to me that groups and group actions are necessarily the correct abstraction for representing semantic similarity.

igorkraw

9 months ago

It's not the right one on its own, but it's the closest we have right now to talk about things like "high level equivalence" rigourously. Like, one way to talk about logic is that it's e.g. invariant under double negation - so any compression algorithm that does "semantically lossless compression" can exploit that property. Deep learning then becomes an algorithm to find features in the data which remain invariant under irrelevant transformations, and representations useful for semantic similarity could then be formulated as a very big and complex composite group, potentially conditional... there's probably a new theory of hierarchal similarity and irreducible collisions/errors required, but as of now, group theory, approximate equivariance and category theory are the bleeding edge of thinking about this rigorously while still having sota benchmarks/real world applicability to my knowledge

Happy to discuss more via email as well

sfink

10 months ago

I tend to agree, but is that really a fatal flaw? A lossy compression scheme that gets it 90% right only has to encode the delta for the remaining 10%. Which is a big cost, sure, but the alternative is evaluating how important the thrown-away stuff is, and that evaluation is rife with subjective value judgements. There are no right answers, only differing flavors of wrong ones. It's the question of what is important to generalize over, and nobody is ever going to agree on that.

abecedarius

10 months ago

Your issue applies equally to GPT-3 (and afaik its successors): the log likelihood loss which the GPT base model was trained on is exactly the "lossless compression" length of the training set (up to trivial arithmetic-coding overhead).

The question I'd raise is why the route that got traction used more ad-hoc regularization schemes instead of the (decompressor + compressed) length from this contest and the book I linked.

vintermann

10 months ago

> Intelligence is just as much about knowing what to throw away as what to keep.

Sure, but you can see it as two steps. Decide what doesn't matter at all, and throw it away. Then compress the rest losslessly (according to how surprising it is, basically the only way)

I think a theory of mind based on data compression is backwards, for this reason. When you have "data", you've already decided what's important. Every time you combine "A" and "B" into a set, you have already decided

1. That they are similar in some way you care about

2. That they're distinct in some way you care about (otherwise, it would be the same element!)

... and you do this every time you even add two numbers, or do anything remotely more interesting with "data". There is no "data" without making a-scientific statements about what's important, what's interesting, what's meaningful, what matters etc.

theendisney

10 months ago

You could throw away 95% of enwiki without losing anything of value. This sounds like a joke but it would make the result worth reading which sounds like it is worth the exercise.

Xcelerate

10 months ago

> unless there's some standard compression dictionary that works for everything

There is, at least for anything worth compressing. It’s called the halting sequence. And while it exists, it’s also uncomputable unfortunately haha.

Vecr

10 months ago

Apparently there's ways of getting the first few binary digits, but it's useless at that length.

Xcelerate

10 months ago

Not useless. That’s most of mathematics right there.

HappMacDonald

10 months ago

What precisely are you lot referring to? I'm familiar with the Halting Problem, and glancingly familiar with Ω / Chaitin's constant(s). But I'm not familiar with any kind of uncomputable sequence which we know how to get the first few binary digits out of that has profound things to say about math and is somehow related to the Halting problem. And trying to Google it doesn't help as I just get search result noise about the Halting problem itself with no reference to any special kinds of sequence.

Vecr

10 months ago

It's Ω / Chaitin's constant for a particular encoding. It's uncomputable as a whole, but maybe we will get a couple more bits.

HappMacDonald

10 months ago

Alright then how does that help anyone with text compression, and/or how would computing it's digits reveal any deep secrets of mathematics?

Xcelerate

10 months ago

See my other reply. It’s not Chaitin’s constant but the mutual information between ZFC and the halting sequence, which is precisely I(ZFC : H) bits.

If you had H, you could optimally compress any non-random string of data of length n in O(n) time. The rough sketch of how you would do this is to create brute-force search programs and then instead of actually running them just look up whether they halt or not via the halting sequence. By chaining together a few of these μ-operator functions, you can build up the whole compressed string 1 bit at a time.

Since we only have a few bits of H, we can’t compress at that level of magic, but what we do have still represents everything computable that ZFC can describe, which for the most part includes every algorithm we use in daily life.

Xcelerate

10 months ago

> What precisely are you lot referring to? [...] glancingly familiar with Ω / Chaitin's constant

Huh? I'm not referring to Chaitin's constant; the first few digits of that aren't going to help with much. I'm referring specifically to the halting sequence H as I mentioned in my post. See Vitányi's "Kolmogorov's Structure Functions and Model Selection" or any of Levin's papers on algorithmic complexity. (Vitányi uses a definition for H where a program's input is the empty tape rather than itself, but this is mainly a difference in convention).

You can take a formal logical theory F (e.g., Q, PA, ZFC, ...) and since the proofs are recursively enumerable, iterate over all proofs equivalent to the statement "p_i halts" or "p_i never halts" in the language of F for program p_i (as indexed lexicographically on a reference universal Turing machine).

As an example, for any halting program p, an unbounded search through PA will eventually find a proof that p halts (assuming soundness of PA). But there are certain non-halting programs for which PA cannot prove the non-halting behavior whereas ZFC can. So in terms of mutual algorithmic information I(a : b) = K(a) + K(b) - K(a, b), PA's inability to prove the non-halting behavior of p implies I(p : PA) = 0 whereas I(p : ZFC) > 0 since ZFC can prove this. More specifically K(p) - K(p | ZFC) > 0, given that the description of ZFC is minimal.

The above implies that I(ZFC : H) > I(PA : H), where H is the halting sequence, so we can say that ZFC is "more powerful" than PA in its ability to prove non-halting behavior of Turing machines. But a minimal description of ZFC and PA is still a finite amount of information, and in fact Chaitin's Incompleteness theorem states that for any formal logical system F, there is a constant C_F beyond which F cannot prove any string has higher Kolmogorov complexity than C_F. I suspect C_F is pretty close to K(F), but I haven't seen a proof of that fact anywhere. This is essentially another way to state Gödel's first incompleteness theorem, with additional detail about the limit of what formal system F can or cannot do.

So when I was referring to the few binary digits that correspond to "most of mathematics", I'm talking about the I(ZFC : H) bits of mutual algorithmic information between ZFC and H. We know there is a 745 state binary Turing machine that encompasses all of ZFC’s behavior (see Scott Aaronson's posts about this), but many people think the minimal size of a machine that computes all that ZFC can prove about the behavior of Turing machines is far smaller than that.

ezekiel68

10 months ago

This is a good point, and yet -- as high as position 24 (out of 208) in the list, we have "bsc-m03 0.4.0" at only ~160ns per byte compressed (compared with most above it being two or three orders of magnitude slower). Below position 24, there's one in the 20s of ns per byte. Looks like the cost was in the size of the compression program. Not sure if there was any "cheating" going on, hunting for benchmark scores with these two famous test bodies of text.

By comparison, the top contender clocks in at ~240,000 ns per byte (making your point).

wqweto

10 months ago

We use bsc and nanozip for archiving DB backups (usually 1-100GB files) and have not found anything remotely close to these as CPU usage vs compression ratio.

    c:> bsc.exe e "%~1" "%~1.bsc" -b1000 -m4e1t -M4H20
This uses 3GB of RAM and compresses w/ about 300MB/s but is slow on decompression while

    c:> nz.exe a -cd -p2 -m1024m "%~1.nz" "%~1"
uses about 1GB of RAM and compresses w/ about 250MB/s to worse compression ratios but is much faster than bsc on decompression.

Both of these smash zstd and 7-zip on compression and speed -- something like 2x better and 10x faster.

lxgr

10 months ago

One interesting application could be instant messaging over extremely bandwidth constrained paths. I wouldn’t be surprised if Apple were doing something like this for their satellite-based iMessage implementation.

Of course it could also just be a very large “classical” shared dictionary (zstd and brotli can work in that mode, for example).

londons_explore

10 months ago

Unfortunately, while you want to have encryption on your messages, the required encryption signature eclipses the message size.

Without signing the messages (and just using a stream cipher), you could do it, but an adversary can send garbage pretending to be you, which I don't think is acceptable in the modern world.

Also, the message headers (message length, from, to) also start to dominate and compressing them is hard (usually depending on various hardware tricks like differing CDMA keys).

lxgr

10 months ago

For such low-bandwidth applications, you'll usually want to do the key exchange while you have a faster connection, as well as build a dictionary of expected recipients etc.

Once you have a pre-exchanged symmetric key pair and IVs etc., encryption can be done with zero overhead, and you can choose your own trade-off between authentication security and message size. Even 4 bytes go a long way (an attacker would need to send 2^31 messages over a very bandwidth-constrained channel to impersonate you), and 8 bytes make it safe enough for practically all applications.

That way, you can keep the authentication overhead very small (on the order of a few bytes), similar to how it's done for e.g. SRTP.

geysersam

10 months ago

Depends on the application! For long term storage of rarely accessed text it might make sense

jeffbee

10 months ago

It does seem as though the training cost of these models ought to be included in their times, because the compressors are useless for any purpose other than compressing English Wikipedia.

gliptic

10 months ago

The training cost is included because these NN compressors are running the training while they are compressing. They are general compressors, although the specific variants submitted here may be tuned a little to Wikipedia (e.g. the pre-processors).

gwern

10 months ago

In these cases, the models are so small, and have to run on very limited CPU, that the training time is going to be fairly negligible. The top-listed program, nncp, is just 0.06b parameters! https://bellard.org/nncp/nncp_v2.pdf#page=4

londons_explore

10 months ago

Imagine if we didn't have dedicated hardware for say caching - every CPU instruction or bit of data has to be read from SSD every use.

We'd see a 100,000x slowdown I expect.

But dedicated hardware (RAM, various cache levels) solved this.

We could do the same for neural net compression if we needed to. But the question is do enough people care enough about a 2x data size reduction?

wmf

10 months ago

Dedicated hardware isn't magic because there's an intrinsic cost that cannot be reduced. In the case of neural nets you have to store the weights somewhere and you have to perform the multiplications.

stjo

9 months ago

The current best solution is written by Fabrice Bellard, probably more famous for his other works like FFMPEG, QEMU and TCC - https://bellard.org/. Check him, pretty much everything coming from him is pure gold.

TacticalCoder

9 months ago

When people say there are no 10x programmer, I name Fabrice Bellard. Him being a 100x programmer invalidates the logic of those trying to explain there aren't 10x programmers.

sfink

10 months ago

I'd kind of like to see some sample LLM entries in the list (as in, at the bottom of the list). Not because they'd ever be competitive, but because it would underscore how they burn massive amounts of space for the model with little concern for how much of it is "necessary" for predicting text.

(I'm not sure how you would optimally use an LLM for a lossless compression task. I guess you could convert the input to a stream of single-bit "yes, it guessed right" tokens and multi-bit "no, the actual token is X" tokens, and then compress that stream with a more traditional entropy encoder? That way the LLM decompressor would always be working off a valid prefix. But there are probably more clever ways.)

Sesse__

10 months ago

> I'm not sure how you would optimally use an LLM for a lossless compression task.

You have the model output probabilities for each token, and then use arithmetic encoding using that model. (This uses optimally few bits per token, under the assumption that your model is right.) This is how all the winning models work, AFAIK, except they frequently work per-bit instead of per-token.

sfink

10 months ago

Oh duh, that makes a lot more sense. Thanks!

jonas21

10 months ago

The top entry is based on a Transformer architecture. It's basically an LLM, but trained on the input instead.

pmayrgundter

10 months ago

The very notable thing here is that the best method uses a Transformer, and no other entry does

londons_explore

10 months ago

Transformers run well on GPU's or other hardware accelerators. This benchmark doesn't allow GPU's.

That makes it more of a "can I use unsuitable hardware to get the job done fast and accurately enough" challenge, rather than a pure math puzzle of how to encode data with fewer bytes.

I suspect that's why there is only 1 Transformer entry, and to me raises the question whether the rules should be updated to allow GPU's now they are fairly commonplace.

HybridCurve

10 months ago

I think it might be a moot point since the transformer run times scale very poorly and the algorithm has a symmetric run time.

wmf

10 months ago

Text compression and text generation are both based on modeling so the best approach to text modeling works for both.

sfink

10 months ago

Not if part of your definition of "best" includes codebook/model size. The best text generator will use a massive model. That won't help compression beyond a certain point, since the cost of lookup indexes grows with the size of the model. "Monkey #384714...872 typewriter output" doesn't help when the monkey number is longer than the input.

pama

10 months ago

It would be nice to also have a competition of this type where within ressonable limits the size of the compressor does not matter and the material to be compressed is hidden and varied over time. For example up to 10GB compressor size and the dataset is a different random chunk of fineweb every week.

stephantul

10 months ago

That’s an interesting idea. I wonder whether it would actually lead to radically different solutions, bit with bigger codebooks.