Large Text Compression Benchmark

31 pointsposted 2 days ago
by redeux

15 Comments

hyperpape

2 hours ago

It's worth noting that the benchmark has not been updated as frequently for the past several years, and some versions of compressors are quite far behind the current implementations (http://www.mattmahoney.net/dc/text.html#history).

The one instance I double-checked (zstd) I don't recall it making a massive difference, but it did make a difference (iirc, the current version was slightly smaller than what was listed in the benchmark).

pella

2 hours ago

agree,

- in the benchmark "zstd 0.6.0" ( Apr 13, 2016 )

- vs. latest zstd v1.5.6 ( Mar 30, 2024 https://github.com/facebook/zstd/releases )

londons_explore

2 minutes ago

Only really worth updating if the results are meaningfully different though.

Since zstd's format is fixed, I doubt there will be massive changes in compression ratio.

nick238

an hour ago

Double your compression ratio for the low, low price of 100,000x slower decompression (zstd: 215GB, 2.2 ns/byte vs. nncp: 106GB, 230 µs/byte)!

The neural network architectures are technically impressive, but unless there's some standard compression dictionary that works for everything (so the training/compression costs amortize down to nil), and silicon architecture dramatically changes to compute-on-memory, I don't know if this would ever take off. Lossy compression would probably provide huge advantages, but then you need to be domain specific, and can't slap it on anything.

abecedarius

35 minutes ago

The point of the contest was not compression tech for communication and storage. It was a belief that minimizing log loss on large corpora (i.e. message length) was key to AI progress. That should remind you of some pretty hot more-recent developments.

Here was someone else with a similar pov in the 2000s: https://danburfoot.net/research.html

Xcelerate

an hour ago

> unless there's some standard compression dictionary that works for everything

There is, at least for anything worth compressing. It’s called the halting sequence. And while it exists, it’s also uncomputable unfortunately haha.

lxgr

an hour ago

One interesting application could be instant messaging over extremely bandwidth constrained paths. I wouldn’t be surprised if Apple were doing something like this for their satellite-based iMessage implementation.

Of course it could also just be a very large “classical” shared dictionary (zstd and brotli can work in that mode, for example).

jeffbee

an hour ago

It does seem as though the training cost of these models ought to be included in their times, because the compressors are useless for any purpose other than compressing English Wikipedia.

gwern

36 minutes ago

In these cases, the models are so small, and have to run on very limited CPU, that the training time is going to be fairly negligible. The top-listed program, nncp, is just 0.06b parameters! https://bellard.org/nncp/nncp_v2.pdf#page=4

gliptic

34 minutes ago

The training cost is included because these NN compressors are running the training while they are compressing. They are general compressors, although the specific variants submitted here may be tuned a little to Wikipedia (e.g. the pre-processors).

pama

2 hours ago

It would be nice to also have a competition of this type where within ressonable limits the size of the compressor does not matter and the material to be compressed is hidden and varied over time. For example up to 10GB compressor size and the dataset is a different random chunk of fineweb every week.

stephantul

28 minutes ago

That’s an interesting idea. I wonder whether it would actually lead to radically different solutions, bit with bigger codebooks.

pmayrgundter

2 hours ago

The very notable thing here is that the best method uses a Transformer, and no other entry does

wmf

an hour ago

Text compression and text generation are both based on modeling so the best approach to text modeling works for both.