hyperpape
10 months ago
It's worth noting that the benchmark has not been updated as frequently for the past several years, and some versions of compressors are quite far behind the current implementations (http://www.mattmahoney.net/dc/text.html#history).
The one instance I double-checked (zstd) I don't recall it making a massive difference, but it did make a difference (iirc, the current version was slightly smaller than what was listed in the benchmark).
pella
10 months ago
agree,
- in the benchmark "zstd 0.6.0" ( Apr 13, 2016 )
- vs. latest zstd v1.5.6 ( Mar 30, 2024 https://github.com/facebook/zstd/releases )
shaicoleman
10 months ago
- zstd v0.6.0 -22 --ultra: 215,674,670 bytes
- zstd v1.5.6 -22 --ultra: 213,893,168 bytes (~0.826% smaller)
adgjlsfhk1
10 months ago
how does the speed compare?
londons_explore
10 months ago
Only really worth updating if the results are meaningfully different though.
Since zstd's format is fixed, I doubt there will be massive changes in compression ratio.
namibj
10 months ago
Ehhh, it's the decompressor that's specified, and there's considerable wiggle room a'la Trellis modulation in radio line coding, due to the interplay of the dictionary coder with the entropy coder. Basically, there are many moments the dictionary coder has to make non-obvious decisions about how to code a sequence, and what matters is that the whole content overall has the least entropy, not the least length when fed to the entropy coder.
For simple codings like the TLV bitstream in QR codes (the L field bit length depends on the T and the QR code size (i.e., the max bitstream length possible))[0], you can afford to solve for all possibilities via e.g. dynamic programming with some mild heuristics to terminate search branches that can't possibly code shorter due to the TL overhead.
But with an entropy coder like zstd's fst/tANS, that's not remotely as trivial. Making a choice will change the symbol/byte histogram for the entropy coder's input sequence, which, if after quantization by tANS still different, can change the length of the coded bitstream. The problem is the non-local effect on a symbol's coding size from making any individual residency coding decision.
BTW, friendly reminder that all-caps URLs will code shorter into QR codes, so the code will have larger pixels or be smaller.
[0]: there are 5 main modes (i.e., T values): ([0-9])+ coded by stuffing 3 digits into 10 bit and trailing 2/1 digits into 7/4 bits, respectively; ([0-9]|[A-Z]|[:space:]|[\$\%*\+\−\/\.,\:])+ coded by stuffing 2 characters into 11 bits; ISO-8859-1 coded by stuffing one character into 8 bits; Kanji by stuffing each into 13 bits; and a ECI mode for iirc generic Unicode codepoints. I think the other 3 code points in T are used for legacy barcode FNC1 and FNC2, as well as an end-of-content marker.
adgjlsfhk1
10 months ago
I think the compression ratio will be a bit better, but the compression and decompression time will be a bunch better.