joistef
a month ago
A few things that might help:
The Delta Lake suggestion is tempting but I don't think it actually solves your core problem - you still need to rewrite rows to populate a new column with non-default values. Their deletion vectors help with row-level updates, not column additions.
The separate table approach is probably your best bet, but I'd avoid row_number joins. If you can do a one-time rewrite to add a content hash or deterministic row ID, your annotation table becomes (row_id, score) and the join is trivial. Yes, it's a one-time cost, but it pays off every time you add a new annotation column.
If a rewrite is truly off the table, the overlay pattern works: keep annotations in separate parquet files keyed by (file_path, row_group_index, row_offset_within_group). DuckDB handles this reasonably well. The ergonomics aren't great but it's honest about the tradeoff you're making.
The parquet metadata idea is clever but metadata is size-limited and not designed for per-row data - you'd be fighting the format.
What query engine are you using? That might change the calculus.