Build a serverless ACID database with this one neat trick (atomic PutIfAbsent)

69 pointsposted 14 hours ago
by todsacerdoti

17 Comments

notamy

10 hours ago

I know the article says "analytics workloads," but I wonder if this pattern could be used for ex. bursty-write workloads.

For context, I have a use-case where I have a Meilisearch server that has all the data in it updated once daily, and then every single request outside of that one bulk update is a read. The only exception is that once in a great while, something breaks and the entire index has to be rebuilt. Since it's a hobby project, my workload is too expensive to fit into Meilisearch Cloud or similar solutions. I keep finding myself wishing that there was a serverless search engine that would free me from having to self-host one, keep a dedicated server running and updated, etc. If I could "just" run a search engine with Cloudflare Workers + R2 (+ probably ancillary CF services) I'd be very happy.

I don't know enough to actually answer my own question unfortunately; I can't intuit the answer just from reading the docs and database implementations aren't my area of expertise.

eatonphil

10 hours ago

> I know the article says "analytics workloads," but I wonder if this pattern could be used for ex. bursty-write workloads.

The key point is that Delta Lake and Iceberg concurrency control is so coarse-grained. As I mentioned in the article, they did this because it seemed they wanted to avoid an external highly-available metadata store (e.g. FoundationDB which is what Snowflake uses).

Just about every Database as a Service (OLTP or OLAP) these days supports storage tiering to cloud blob storage. Delta Lake and Iceberg just took cloud blob storage to the extreme by trying to do everything in it (or as much as possible, Iceberg seems to require a Compare-and-Swap operation which I'm not sure is possible to express with atomic PutIfAbsent alone; seems to require an external HA metadata store for their one current metadata pointer). And in doing this they gave up fine-grained concurrency control.

maxmcd

8 hours ago

I have also been thinking about this:

https://slatedb.io/ is new and KV on object storage. Maybe helpful.

There is lots of similar work over ipfs, I have not had the chance to experiment yet, but I'd like to see what a workflow is like with (eg): summa: https://github.com/izihawa/summa

I tried to implement trigram search on object storage using Zoekt: https://github.com/sourcegraph/zoekt, but I found that common queries load 30-40% of the entire index, so setting it up without some kind of caching strategy felt a little ridiculous.

bigiain

8 hours ago

Back in the day (and I'm talking about when memcached was still written in Perl here, so late 90s or so) "bursty-write workloads" were a really good fit for us - using SQL queries as keys and storing the responses as values. For every read you first do a (super fast) memcached query using the SQL you're about to run as the key. If it exists, you just send the value back without running the query, if it doesn't exist you run the query and store the response as the value before returning it.

It depends a lot on how big your dataset is, and the distribution of reads across your data. But if you have a "fat head long tail" type distribution it can work really well.

(Of course cache invalidation is always a hard problem, which we worked around by trying to ensure all our code that did database writes knew how to delete relevant memcached entries, force expiring the whole cache at our lowest traffic times, and having a big red panic button that would delete the entire cache if needed.)

chipdart

4 hours ago

> For every read you first do a (super fast) memcached query using the SQL you're about to run as the key. If it exists, you just send the value back without running the query, if it doesn't exist you run the query and store the response as the value before returning it.

Wouldn't it have been far simpler to just move the db behind a service that provides a cacheable request-response interface and just adopt the simplest of caching strategies? In the case of HTTP, you can even get local cache for free. No need to throw specialized memory cache services around.

bigiain

3 hours ago

Like I said, I was doing that with memcached about 25-30 years back. Nobody had yet built "a service that provides a cacheable request-response interface" to move the db behind.

Just last week I helped one of our devs through "caching" API responses for a "read heavy" use case by saving the responses to S3 and having the web app there instead of hitting the API endpoint and database.

There are a bunch of ways to skin this cat in 2024.

(One brilliant idea I've got bookmarked but haven't (yet) had an opportunity toi use, is a modified SQLite that uses S3 storage and range requests and is compiles to WASM - so a web app can run SQL queries locally and all the usual SQLite stuff "just works" from the warm/javascript's point of view. It _does_ need to be a very read-heavy use case though, because updating even a single row needs to upload the entire SQLite db file into S3.)

danielheath

7 hours ago

My approach to this class of issue has been to make the reindex be "Creates a new search server, build the index, then switch the application to the new one and turn off the old one".

At the cost of doubling storage/IO requirements for search, this works around virtually every kind of outage-due-to-index-corruption (don't switch to the new one if it fails healthchecks).

jakozaur

2 hours ago

Iceberg was acquired for $1Bln to create a similar standard called Iceberg.

It's worth highlighting the pros/cons of this approach.

Pros:

+ straightforward, no running metadata server needed, scales to zero compute

+ zero-copy clones, time travel/versioning and sharable. Git style, but for your data

+ can work with many vendors without the need to copy data

Cons:

- single optimistic thread lock. This does not work if you have many clients doing many small writes concurrently. You need to batch write, ideally in an append-mainly way.

- quite low-level. You still need someone doing compaction, access control, etc.

- you miss some database features (e.g. skip index based on bloomfilter), some other such as materialized views based on data will be hard to implement if you directly write to blob storage

The data lake is amazing for cold data; for fast-changing recent data (e.g. last 1 day), it is better to use different data storage.

benreesman

10 hours ago

For folks interested in this topic I highly recommend this talk by Cliff Click about building highly scalable hash tables via Compare And Swap:

https://youtu.be/HJ-719EGIts

rohan_

4 hours ago

Can anyone explain the rise in DBs backed by object storage?

chipdart

3 hours ago

> Can anyone explain the rise in DBs backed by object storage?

I dare say it's a "when the only tool you have is a hammer" problem.

But it might also be the fact that object storage services are reliable, relatively cheap by cloud standards, and do not have a upper bound on resource availability.

kthejoker2

6 hours ago

Just pointing out Delta Lake and Iceberg are based on columnar Parquet format and are designed for OLAP / data warehousing workloads.

So managing write concurrency is definitely a secondary priority since it's not the norm to have multiple writers.

convolvatron

12 hours ago

I wish we generally used solutions like this instead of always reaching for the 200lb database hammer.

however, I think the article would benefit from distinguishing this approach from a 'real' database. that is MVCC gets you a lot more concurrency because it spreads out the exclusionary state in both time and space.

the optimistic concurrency used in the article works well at low contention and then becomes arbitrarily latent as more and more proposers try to jump into same RTT gap with both their initial attempt and their retries.

eatonphil

12 hours ago

> however, I think the article would benefit from distinguishing this approach from a 'real' database. that is MVCC gets you a lot more concurrency because it spreads out the exclusionary state in both time and space.

I thought I already was going to be seen as being unfairly critical of the course-ness of the concurrency control.

convolvatron

10 hours ago

in any case thanks for writing it up. I was just thinking about doing this in a broadcast domain to provide the same kind of low-throughput bootstrapping consistency that people would normally use zookeeper or etcd for. its certainly very useful in the cases where it fits.