vouwfietsman
an hour ago
Not sure why this got so many upvotes, also the landing page is not great, its better to look at the paper (see link below).
Seems to be a columnar storage format that addresses some shortcomings in parquet. Thing is, though, that of all these formats the real winning feature is compatibility, which is (obviously) very hard to improve on, as anything new immediately loses.
Parquet is unfortunately very good just by virtue of being first, and so widely supported. The most widely used parquet version is the oldest version from 2013 (as per the paper itself), so parquet itself couldn't even supplant parquet. If you want to improve on it, you need to bring some serious results, which I don't think f3 does.
Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.
Also also, it seems to go out of its own way to include a compiled wasm binary for decoding, yet requires flatbuffers to parse that blob? Kind of defeats the purpose.
Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics. F3 seems to sacrifice fast analytics for the wasm decoder. I don't get it.
Maybe I'm being too cynical. Can someone help me out here?
aduffy
an hour ago
> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.
This is really more of an expectation that has been put on file formats by the query engines. Spark/Datafusion/DuckDB wouldn't really know what to do with a multi-table file.
> Parquet is unfortunately very good just by virtue of being first, and so widely supported
IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.
> Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics
Fast analytics, as well as newer ML-shaped workloads, are inherently mix of batch scans and random access.
Some of the authors of F3 previously authored another paper that goes into the details of the shortcomings of Parquet
https://www.vldb.org/pvldb/vol17/p148-zeng.pdf
All of the newer formats that popped up recently (Vortex, Lance, F3 now) have been working on solving the problems outlined in that paper.
Lance has some interesting ideas, Vortex focuses on extensibility and performance by replacing all of Parquet's black-box encoders with fully transparent encodings. This solves the tradeoff between bulk and element decoding, allowing you to have efficient full scans and really fast random access.
E.g. Langchain recently rebuilt a system that used to be all Parquet files to use Vortex and saw a massive speedup, which they talk about more here: https://www.langchain.com/blog/introducing-smithdb
Disclaimer: I work on Vortex, so a lot of these questions about "what is the point of building a new format" are things that I have grappled with myself.
vouwfietsman
39 minutes ago
> DuckDB wouldn't really know what to do with a
Sure it would, you can attach a multi-table sqlite database in duckdb
> that does not mean just because it came first
I agree with most of your points, I am not stating my opinion but my observations. I am the target audience here, I want to use this, but I don't really care too much about the file format itself, at least not as much as I care about the data inside.
That means access, which means compatibility with my tooling.
Compatibility is hard to beat.
This is the concorde of file formats.
aduffy
26 minutes ago
That is fair.
FWIW I think if you are just doing pure analytics and nothing else, Parquet will probably continue to do the job for you just fine, and you don't need to touch your workloads at all.
These new formats I think will find a niche where people aren't just running Spark jobs, but doing lots of systems building over large tables. If you're building a PB-scale data warehouse, you care a lot about the file format b/c it is a big factor in your performance curve, and you're willing to ship new experimental codecs in response to new datatypes you want to support that the system wasn't originally designed for, or you want to use a newly invented compressor.
sanderjd
14 minutes ago
Yeah that point about "random access is not the point of columnar formats" fell flat for me for this same reason. Almost since the first day I started using columnar data, I've been interested in solutions that strike this balance between batch and random access. This comes up all the time (in my experience) in data science / ML, where we have use cases for both access patterns against the same data.
So I'm with you, I'm very unconvinced that parquet (and the various things that are parquet or essentially-parquet under the hood) are the end of the line here.
mschuster91
25 minutes ago
>Not sure why this got so many upvotes, also the landing page is not great
Frankly it's a change from the usual ChatGPT generated slop that most landing pages are these days.