> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates
I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself.
A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.
I'm surprised that Parquet didn't maintain the Arrow practice of using mmap-able relative offsets for everything. Although these could be called relative to the beginning of the file.
I believe Parquet predates Arrow. That's probably why.
Wouldn't be it easier to extend delta-rs to support deduplication?
Can you elaborate? As I understand Delta Lake provides transactions on top of existing data and effectively stores "diffs" because it knows what the transaction did. But when you have regular snapshots, its much harder to figure out the effective diff and that is where deduplication comes in. (Quite like how git actually stores snapshots of every file version, but very aggressively compressed).
Love this post and the visuals! Great work
How does this compare to rsync/rdiff?
Great question! Rsync also uses a rolling hash/content defined chunking approach to deduplicate and reduce communication. So it will behave very similarly.
One more: do you prefer the CDC technique over using the rowgroups as chunks (ie using knowledge of the file structure)? Is it worth it to build a parquet-specific diff?
I think both are necessary. The cdc technique is file format independent. The row group method makes Parquet robust to it.
I just don't understand how these guys could literally give terrabytes of free storage and free data transfer to everyone. I was doing some calculation of cost from my storage and transfers and if they used something like S3 it would costed them 1000s of dollar. And I don't pay them anything.
> As Hugging Face hosts nearly 11PB of datasets with Parquet files alone accounting for over 2.2PB of that storage
11PB on S3 would cost ~$250k per month / $3m per year.
HuggingFace has raised almost $400M.
Not saying it's nothing, but probably not a big deal to them (e.g. ~10 of their 400+ staff cost more).
HuggingFace really is such an amazing resource to the ML community. Not just for storing datasets, but being able to stand up a demo of my models using spaces for anyone to use? It's hard to overstate how useful that is.
We are here to help lower that :-) . As we can push dedupe to the edge we can save on bandwidth as well. And hopefully make everyone upload and download faster.