jasim
4 days ago
I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.
One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)
This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.
They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.
In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).
Also in general this is a really good deep dive into columnar data storage.
hodgesrm
3 days ago
One question that the article does not cover: compaction. Adding custom indexes means you have to have knowledge of the indexes to compact Parquet files, since you'll want to reindex each time compaction occurs. Otherwise the indexes will at best be discarded. At worst they would even be corrupted.
So it looks as if adopting custom indexes mean you are adopting not just a particular engine for reading but also a particular engine for compaction. That in turn means you can't use generic mechanisms like the compaction mechanism in S3 table buckets. Am I missing something?
dmvinson
4 days ago
What are the new file format initiatives you're referencing here?
This solution seems clever overall, and finding a way to bolt on features of the latest-and-greatest new hotness without breaking backwards compatibility is a testament to the DataFusion team. Supporting legacy systems is crucial work, even if things need a ground-up rewrite periodically.
MasterIdiot
4 days ago
Off the top of my head:
- Vortex https://github.com/vortex-data/vortex
- Lance https://github.com/lancedb/lance
- Nimble https://github.com/facebookincubator/nimble
There are also a bunch of ideas coming out of academia, but I don't know how many of them have a sustained effort behind them and not just a couple of papers
dkdcio
4 days ago
Lance (from LanceDB folks), Nimble (from Meta folks, formerly known as Alpha); I think there are a few others
kernelsanderz
3 days ago
I’ve been excited about lancedb and its ability to support vector indexes and efficient row level lookups. I wonder if this approach would work for their design goals and still allow broader backwards compatibility with the parquet ecosystem. Have been intrigued by Ducklake, and they’ve leaned into parquet. Perhaps this approach will allow more flexible indexing approaches with support for the broader parquet ecosystem which is significant.
lmeyerov
4 days ago
Yeah I'm happy to see this, we have been curious as part of figuring out cloud native storage extensions to GFQL (graph dataframe-native query lang), and my intuition was parquet was pluggable here... And this is the first I'm seeing a cogent writeup.
Likewise, this means, afaict, it's likewise pretty straightforward to do novel indexing schemes within Iceberg as well just by reusing this.
The other aspect I've been curious about is the happy path pluggable types for custom columns. This shows one way, but I'm unclear if same thing.
alamb
4 days ago
We are actively working on supporting extension types. The mechanism is likely to be using the Arrow extension type mechanism (a logical annotation on top of existing Arrow types https://arrow.apache.org/docs/format/Columnar.html#format-me...)
I expect this to be used to support Variant https://github.com/apache/datafusion/issues/16116 and geometry types
(note I am an author)
jasim
4 days ago
I'm not sure if this is what you're looking for, but there is a proposal in DataFusion to allow user defined types. https://github.com/apache/datafusion/issues/12644
lmeyerov
3 days ago
Thank you, looking forward to reading!
jjtheblunt
4 days ago
nice summary!