divbzero
7 months ago
This seems like a lot of added complexity for limited gain. Are there cases where gzip and br at their highest compression levels aren’t good enough?
pmarreck
7 months ago
Every piece of information or file that is compressed sends a dictionary along with it. In the case of, say, many HTML or CSS files, this dictionary data is likely nearly completely redundant.
There's almost no added complexity since zstd already handles separate compression dictionaries quite well.
pornel
7 months ago
The standard compressed formats don't literally contain a dictionary. The decompressed data becomes its own dictionary while its being decompressed. This makes the first occurrence of any pattern less efficiently compressed (but usually it's still compressed thanks to entropy coding), and then it becomes cheap to repeat.
Brotli has a default dictionary with bits of HTML and scripts. This is built in into the decompressor, and not sent with the files.
The decompression dictionaries aren't magic. They're basically a prefix for decompressed files, so that a first occurrence of some pattern can be referenced from the dictionary instead of built from scratch. This helps only with the first occurrences of data near the start of the file, and for all the later repetitions the dictionary becomes irrelevant.
The dictionary needs to be downloaded too, and you're not going to have dictionaries all the way down, so you pay the cost of decompressing the data without a dictionary whether it's a dictionary + dictionary-using-file, or just the full file itself.
yorwba
7 months ago
> The dictionary needs to be downloaded too
Which is why the idea is to use a previous version of the same file, which you already have cached from a prior visit to the site. You pay the cost of decompressing without a dictionary, but only on the first visit. Basically it's a way to restore the benefits of caching for files that change often, but only a little bit each time.
zvr
7 months ago
Of course, the Brotli default (built-in) dictionary is infamous for containing such strings like "Holy Roman Emperor", "Confederate States", "Dominican Republic", etc., due to the way it was created. One can see the whole dictionary in https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37....
Having a dictionary created by actual content to be compressed will end up with a very different dictionary.
pmarreck
7 months ago
> The dictionary needs to be downloaded too, and you're not going to have dictionaries all the way down
We already have a way to manage this: Standardizing and versioning dictionaries for various media types (also with a checksum), and then just caching them locally forever, since they should be immutable by design.
To prevent an overgrowth of dictionaries with small differences, we could require each one to be an RFC.
ks2048
7 months ago
Some examples here: https://github.com/WICG/compression-dictionary-transport/blo...
show significant gain of using dictionary over compressed w/o dictionary.
It seems like instead of sites reducing bloat, they will just shift the bloat to your hard-drive. Some of the examples said dictionary of 1MB which doesn't seem big, but could add up if everyone is doing this.
sltkr
7 months ago
That demonstrates how useless this is. It only shaves off kilobytes on extremely bloated sites that waste megabytes of data.
For example, take the CNN example:
> The JavaScript was 98% smaller using the previous version as a dictionary for the new version than if the new version was downloaded with brotli alone. Specifically, the 278kb JavaScript was 90kb with brotli alone and 2kb when using brotli and the previous version as a dictionary.
Oh wow! 98% savings! That's amazing! Except in absolute terms the difference between 90 KB and 2 KB is only 88 KB. Meanwhile, cnn.com pulls in 63.7 MB of data just on the first page load. So in reality, that 88 KB saved was less than 0.14% of the total data, which is negligible.
yorwba
7 months ago
What makes you think this would stop working if applied to 63.7 MB of JavaScript instead of just one file?
bsmth
7 months ago
If you're shipping a JS bundle, for instance, that has small, frequent updates, this should be a good use case. There's a test site here that accompanies the explainer which looks interesting for estimates: https://use-as-dictionary.com/generate/
wat10000
7 months ago
In some applications, there’s no “good enough,” even small gains help and can be significant when multiplied across a large system. It’s like the software version of American Airlines saving $40,000/year by removing one olive from their salads.