notamy
a year ago
I know the article says "analytics workloads," but I wonder if this pattern could be used for ex. bursty-write workloads.
For context, I have a use-case where I have a Meilisearch server that has all the data in it updated once daily, and then every single request outside of that one bulk update is a read. The only exception is that once in a great while, something breaks and the entire index has to be rebuilt. Since it's a hobby project, my workload is too expensive to fit into Meilisearch Cloud or similar solutions. I keep finding myself wishing that there was a serverless search engine that would free me from having to self-host one, keep a dedicated server running and updated, etc. If I could "just" run a search engine with Cloudflare Workers + R2 (+ probably ancillary CF services) I'd be very happy.
I don't know enough to actually answer my own question unfortunately; I can't intuit the answer just from reading the docs and database implementations aren't my area of expertise.
eatonphil
a year ago
> I know the article says "analytics workloads," but I wonder if this pattern could be used for ex. bursty-write workloads.
The key point is that Delta Lake and Iceberg concurrency control is so coarse-grained. As I mentioned in the article, they did this because it seemed they wanted to avoid an external highly-available metadata store (e.g. FoundationDB which is what Snowflake uses).
Just about every Database as a Service (OLTP or OLAP) these days supports storage tiering to cloud blob storage. Delta Lake and Iceberg just took cloud blob storage to the extreme by trying to do everything in it (or as much as possible, Iceberg seems to require a Compare-and-Swap operation which I'm not sure is possible to express with atomic PutIfAbsent alone; seems to require an external HA metadata store for their one current metadata pointer). And in doing this they gave up fine-grained concurrency control.
maxmcd
a year ago
I have also been thinking about this:
https://slatedb.io/ is new and KV on object storage. Maybe helpful.
There is lots of similar work over ipfs, I have not had the chance to experiment yet, but I'd like to see what a workflow is like with (eg): summa: https://github.com/izihawa/summa
I tried to implement trigram search on object storage using Zoekt: https://github.com/sourcegraph/zoekt, but I found that common queries load 30-40% of the entire index, so setting it up without some kind of caching strategy felt a little ridiculous.
bigiain
a year ago
Back in the day (and I'm talking about when memcached was still written in Perl here, so late 90s or so) "bursty-write workloads" were a really good fit for us - using SQL queries as keys and storing the responses as values. For every read you first do a (super fast) memcached query using the SQL you're about to run as the key. If it exists, you just send the value back without running the query, if it doesn't exist you run the query and store the response as the value before returning it.
It depends a lot on how big your dataset is, and the distribution of reads across your data. But if you have a "fat head long tail" type distribution it can work really well.
(Of course cache invalidation is always a hard problem, which we worked around by trying to ensure all our code that did database writes knew how to delete relevant memcached entries, force expiring the whole cache at our lowest traffic times, and having a big red panic button that would delete the entire cache if needed.)
chipdart
a year ago
> For every read you first do a (super fast) memcached query using the SQL you're about to run as the key. If it exists, you just send the value back without running the query, if it doesn't exist you run the query and store the response as the value before returning it.
Wouldn't it have been far simpler to just move the db behind a service that provides a cacheable request-response interface and just adopt the simplest of caching strategies? In the case of HTTP, you can even get local cache for free. No need to throw specialized memory cache services around.
bigiain
a year ago
Like I said, I was doing that with memcached about 25-30 years back. Nobody had yet built "a service that provides a cacheable request-response interface" to move the db behind.
Just last week I helped one of our devs through "caching" API responses for a "read heavy" use case by saving the responses to S3 and having the web app there instead of hitting the API endpoint and database.
There are a bunch of ways to skin this cat in 2024.
(One brilliant idea I've got bookmarked but haven't (yet) had an opportunity toi use, is a modified SQLite that uses S3 storage and range requests and is compiles to WASM - so a web app can run SQL queries locally and all the usual SQLite stuff "just works" from the warm/javascript's point of view. It _does_ need to be a very read-heavy use case though, because updating even a single row needs to upload the entire SQLite db file into S3.)
nsenifty
a year ago
> so a web app can run SQL queries locally and all the usual SQLite stuff "just works" from the warm/javascript's point of view
Duckdb-wasm can do this.
bigiain
a year ago
Does it do the S3 storage and http range requests to seek to the needed byte offsets thing?
I think one day I'm gonna have a thing where I need a few GB or tens of GB of database to be available in a web or mobile app, but where that db rarely changes, and I want to do this without having to support/secure a backend platform or database.
orf
a year ago
Parquet.
billywhizz
a year ago
i've been playing around with this idea on and off for a while. this was the original demo/implementation afaik: https://phiresky.github.io/blog/2021/hosting-sqlite-database....
this has all sorts of nice properties if it suits your needs, especially if you have lots of small read-only databases that just receive updates occasionally.
there's a little demo i did here: https://billywhizz.io/squeel-demo/
danielheath
a year ago
My approach to this class of issue has been to make the reindex be "Creates a new search server, build the index, then switch the application to the new one and turn off the old one".
At the cost of doubling storage/IO requirements for search, this works around virtually every kind of outage-due-to-index-corruption (don't switch to the new one if it fails healthchecks).
user
a year ago