boshomi
3 hours ago
Why not just use Ducklake?[1] That reduces complexity[2] since only DuckDB and PostgreSQL with pg_duckdb are required.
[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o
mslot
2 hours ago
DuckLake is pretty cool, and we obviously love everything the DuckDB is doing. It's what made pg_lake possible, and what motivated part of our team to step away from Microsoft/Citus.
DuckLake can do things that pg_lake cannot do with Iceberg, and DuckDB can do things Postgres absolutely can't (e.g. query data frames). On the other hand, Postgres can do a lot of things that DuckDB cannot do. For instance, it can handle >100k single row inserts/sec.
Transactions don't come for free. Embedding the engine in the catalog rather than the catalog in the engine enables transactions across analytical and operational tables. That way you can do a very high rate of writes in a heap table, and transactionally move data into an Iceberg table.
Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
There's also the interoperability aspect of Iceberg being supported by other query engines.
mritchie712
an hour ago
> For instance, it can handle >100k single row inserts/sec.
DuckLake already has data-inlining for the DuckDB catalog, seems this will be possible once it's supported in the pg catalog.
> Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
This is true, but it's not clear where I'd use this in practice. e.g. if I need to run a complex ETL job, I probably wouldn't do it in pg_cron.
jabr
2 hours ago
How does this compare to https://www.mooncake.dev/pgmooncake? It seems there are several projects like this now, with each taking a slightly different approach optimized for different use cases?
mslot
an hour ago
Definitely similar goals, from the Mooncake author: https://news.ycombinator.com/item?id=43298145
I think pg_mooncake is still relatively early stage.
There's a degree of maturity to pg_lake resulting from our team's experience working on extensions like Citus, pg_documentdb, pg_cron, and many others in the past.
For instance, in pg_lake all SQL features and transactions just work, the hybrid query engine can delegate different fragments of the query into DuckDB if the whole query cannot be handled, and having a robust DuckDB integration with a single DuckDB instance (rather than 1 per session) in a separate server process helps make it production-ready. It is used in heavy production workloads already.
No compromise on Postgres features is especially hard to achieve, but after a decade of trying to get there with Citus, we knew we had to get that right from day 1.
Basically, we could speed run this thing into a comprehensive, production-ready solution. I think others will catch up, but we're not sitting still either. :)
j_kao
2 hours ago
FYI the mooncake team was acquired by Databricks so it's basically vendors trying to compete on features now :)
anktor
an hour ago
What does data frames mean in this context? I'm used to them in spark or pandas but does this relate to something in how duckDB operates or is it something else?
pgguru
2 hours ago
Boils down to design decisions; see: https://news.ycombinator.com/item?id=45813631