arjie
11 hours ago
The fact that ZIP files include the catalog/directory at the end is such nostalgia fever. Back in the day it meant that if you naïvely downloaded the file, a partial download would be totally useless. Fortunately, in the early 2000s, we got HTTP's Range and a bunch of zip-aware downloaders that would fetch the catalog first so that you could preview a zip you were downloading and even extract part of a file! Good times. Well, not as good as now, but amusing to think of today.
st_goliath
9 hours ago
> ... a partial download would be totally useless ...
no, not totally. The directory at the end of the archive points backwards to local headers, which in turn include all the necessary information, e.g. the compressed size inside the archive, compression method, the filename and even a checksum.
If the archive isn't some recursive/polyglot nonsense as in the article, it's essentially just a tightly packed list of compressed blobs, each with a neat, local header in front (that even includes a magic number!), the directory at the end is really just for quick access.
If your extraction program supports it (or you are sufficiently motivated to cobble together a small C program with zlib....), you can salvage what you have by linearly scanning and extracting the archive, somewhat like a fancy tarball.
brabel
37 minutes ago
> the directory at the end is really just for quick access.
No, its purpose was to allow multi floppy disks archives. You would insert the last disk, then the other ones, one by one…
tonyedgecombe
4 hours ago
XPS (Microsoft's alternative to PDF) supported this. XPS files were ZIP files under the hood and were handled directly by some printers. The problem was the printer never had enough memory to hold a large file so you had to structure the document in a way it could be read a page at a time from the start.
nwallin
8 hours ago
At work, our daily build (actually 4x per day) is a handful of zip files totaling some 7GB. The script to get the build would copy the archives over the network, then decompress then into your install directory.
This works great on campus, but when everyone went remote during COVID it wasn't anymore. It went from three minutes to like twenty minutes.
However. Most files change only rarely. I don't need all the files, just the ones which are different. So I wrote a scanner thing which compares the zip file's filesize and checksum to the checksum of the local file. If they're the same, we skip it, otherwise, we decompress out of the zip file. This cut the time to get the daily build from 20 minutes to 4 minutes.
Obviously this isn't resilient to an attacker, crc32 is not secure, but as an internal tool it's awesome.
Karliss
9 hours ago
Partial zip shouldn't be totally useless and a good unzip tool should be able to repair such partial downloads. In addition to catalog at end zip also have local headers before each file entry. So unless you are dealing with maliciously crafted zip file or zip file combined with something else, parsing it from start should produce identical result. Some zip parsers even default to sequential parsing behavior.
This redundant information has lead to multiple vulnerabilities over the years. As having redundant information means that a maliciously crafted zip file with conflicting headers can have 2 different interpretations when processed by 2 different parsers.
EvanAnderson
8 hours ago
Partial downloads weren't useless, though, as other commenters have said.
The PKZIP tools came with PKZIPFIX.EXE, which would scan the file from the beginning and rebuild a missing central archive. You could extract any files up to the truncated file where your download stopped.
halapro
7 hours ago
I hate that the most common video container on the web does this too. Most non-"stream-ready" mp4 files lack even the basic information such as height/width until the file has completed loading.[1]
[1]: https://forum.videohelp.com/threads/393096-Fixing-Partially-Download-MP4-Filescat_plus_plus
8 hours ago
Well what do you want it to do, it doesn't know full directory with offsets until it's done compressing and dispersed directory would have lousy access pattern for quick listing. And you know, if you are compressing you probably want the smallest file so duplicate directories are not idea.