Improving Parquet Dedupe on Hugging Face Hub

43 pointsposted 8 hours ago
by ylow

14 Comments

ignoreusernames

6 hours ago

> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates

I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.

kwillets

7 hours ago

I'm surprised that Parquet didn't maintain the Arrow practice of using mmap-able relative offsets for everything. Although these could be called relative to the beginning of the file.

ylow

7 hours ago

I believe Parquet predates Arrow. That's probably why.

jmakov

7 hours ago

Wouldn't be it easier to extend delta-rs to support deduplication?

ylow

7 hours ago

Can you elaborate? As I understand Delta Lake provides transactions on top of existing data and effectively stores "diffs" because it knows what the transaction did. But when you have regular snapshots, its much harder to figure out the effective diff and that is where deduplication comes in. (Quite like how git actually stores snapshots of every file version, but very aggressively compressed).

skadamat

7 hours ago

Love this post and the visuals! Great work

kwillets

7 hours ago

How does this compare to rsync/rdiff?

ylow

7 hours ago

Great question! Rsync also uses a rolling hash/content defined chunking approach to deduplicate and reduce communication. So it will behave very similarly.

kwillets

6 hours ago

One more: do you prefer the CDC technique over using the rowgroups as chunks (ie using knowledge of the file structure)? Is it worth it to build a parquet-specific diff?

ylow

6 hours ago

I think both are necessary. The cdc technique is file format independent. The row group method makes Parquet robust to it.

YetAnotherNick

7 hours ago

I just don't understand how these guys could literally give terrabytes of free storage and free data transfer to everyone. I was doing some calculation of cost from my storage and transfers and if they used something like S3 it would costed them 1000s of dollar. And I don't pay them anything.

mritchie712

6 hours ago

> As Hugging Face hosts nearly 11PB of datasets with Parquet files alone accounting for over 2.2PB of that storage

11PB on S3 would cost ~$250k per month / $3m per year.

HuggingFace has raised almost $400M.

Not saying it's nothing, but probably not a big deal to them (e.g. ~10 of their 400+ staff cost more).

fpgaminer

5 hours ago

HuggingFace really is such an amazing resource to the ML community. Not just for storing datasets, but being able to stand up a demo of my models using spaces for anyone to use? It's hard to overstate how useful that is.

ylow

6 hours ago

We are here to help lower that :-) . As we can push dedupe to the edge we can save on bandwidth as well. And hopefully make everyone upload and download faster.