hackernews client

More databases should be single-threaded

52 pointsposted 2 months ago

(blog.konsti.xyz)

26 Comments

kgeist

2 months ago

From what I understand, the complexity stays there, it's just moved from the DB layer to the app layer (now I have to decide how to shard data, how to reshard, how to synchronize data across shards, how to run queries across shards without wildly inconsistent results), so as I developer I have more headaches now than before, when most of that was taken care of by the DB. I don't see why it's an improvement.

The author also mentions B2B and I'm not sure how it's going to work. I understand B2C where you can just say "1 user=1 single-threaded shard" because most user data is isolated/independent from other users. But with B2B, we have accounts ranging from 100 users per organization to 200k users per organization. Something tells me making a 200k account single-threaded isn't a good idea. On the other hand, artificially sharding inside an organization will lead to much more complex queries overall too, because usually a lot of business rules require joining different users' data within 1 org.

n2d4

2 months ago

It's a different kind of complexity. Essentially, your app layer needs shift from:

    - transaction serializability
    - atomicity
    - deadlocks (generally locks)
    - occ (unless you do VERY long tx, like a user checkout flow)
    - retries
    - scale, infrastructure, parameter tuning

towards thinking about

    - separating data into shards
    - sharding keys
    - cross-shard transactions

which can be sometimes easier, sometimes harder. I think there are a surprising amount of problems where it's much easier to think about sharding than about race conditions!

> But with B2B, we have accounts ranging from 100 users per organization to 200k users per organization.

You'd be surprised at how much traffic a single core (or machine) can handle — 200k users is absolutely within reach. At some point you'll need even more granular sharding (eg. per user within organization), but at that point, you would need sharding anyways (no matter your DB).

bawolff

2 months ago

If you have to think about cross-shard transactions then you have to think about all the things on your first list too, as they are complexities related to transaction. I fail to see how that could possibly be simpler.

n2d4

2 months ago

Cross-shard transactions are only a tiny fraction of transactions — if the complexities of dealing with that is constrained to some transactions instead of all of them, you're saving yourself a lot of headaches.

Actually, I'd argue a lot of apps can do entirely without cross-shard transactions! (eg. sharding by B2B orgs)

user

2 months ago

[deleted]

whizzter

2 months ago

Yeah, mgmt (and more than anything, query tools) is gonna be a PITA.

But looking at it in a different way, say building something like Google Sheets.

One could place user-mgmt in one single-threaded database (Even at 200k users you probably don't have too many concurrently modifying administrators) whilst "documents" gets their own database. I'm prototyping one such "document" centric tool and the per-document DB thinking has come up, debugging users problems could be as "simple" as cloning a SQLite file.

Now on the other hand if it's some ERP/CRM/etc system with tons of linked data that naturally won't fly.

Tool for the job.

codeslinger

2 months ago

Disclaimer: ex-AWS here.

This article ends up making a compelling case for DynamoDB. It has the properties he describes wanting. Many, many systems inside of Amazon are built with DDB as the primary datastore. I don't know of any OSS commensurate to DDB, but it would be quite interesting for one to appear.

> "Every transaction can only be in one shard" only works for simple business logics.

You'd be quite surprised at what you can get out of this model. Check out the later chapters of the DynamoDB Book [1] for some neat examples.

[1] https://dynamodbbook.com/

bboreham

2 months ago

The famous OSS database patterned after DynamoDB is https://cassandra.apache.org/

(Wondering if you never heard of it or if you don’t consider it commensurate).

user

2 months ago

[deleted]

somat

2 months ago

This is needlessly pedantic but postgres is not multi threaded, Or at least historically it was not I don't really know if threading is currently being added or if the cloudbased "postgres-buts" have threading. but postgres standard is a pretty good example of the traditional forking unix server, And as someone deeply suspicious of threads in general, just because we could define a shared memory execution modal didn't mean that we should have done it, I think shared memory introduces too many footguns. Sticking with a multi-process model is probably the correct choice for postgress.

Needlessly pedantic because the distinction between multi process with explicit shared memory(shm) vs multi threaded with implicit shared memory probably does not really matter all that much.

qouteall

2 months ago

In real-world business requirements it often need to read some data then touch other data based on previous read result.

It violates the "every transaction can only be in one shard" constraint.

For a specific business requirement it's possible to design clever sharding to make transaction fit into one shard. However new business requirements can emerge and invalidate it.

"Every transaction can only be in one shard" only works for simple business logics.

n2d4

2 months ago

I talk about these problems in the "How hard can sharding be?" section of the article — long story short, not all business requirements can be dealt with easily, but surprisingly many can if you choose a smart sharding key.

You can also still do optimistic concurrency across shards! That covers most of the remaining ones. Anything that requires anything more complex — sagas, 2PC, etc. — is relatively rare, and at scale, a traditional SQL OLTP will also struggle with those.

qouteall

2 months ago

Thanks for reply.

So in my understanding:

- The transactions that only touch one shard is simple

- The transactions that read multiple shards but only write shard can use simple optimistic concurrency control

- The transactions that writes (and reads) multiple shards stay complex. Can be avoided by designing a smart sharding key. (hard to do if business requirement is complex)

qouteall

2 months ago

The optimistic concurrency control that reads multiple shards cannot use simple CAS. It probably needs to do something like two-phase committing

n2d4

2 months ago

That's right!

If you anticipate you will encounter the third type a lot, and you don't anticipate that you will need to shard either way, what I'm talking about here makes no sense for you.

hinkley

2 months ago

Business people have a nasty habit of identifying two independent pieces of data you have and finding ideas to combine them to do something new. They aren’t happy until every piece of data is copied with every other piece and then they still aren’t happy because now everything is horrible because everything is coupled to everything.

rowanG077

2 months ago

in my experience most backends I have worked on people don't use the facilities of their database. They indeed simply hit the database two or more times. But that doesn't mean it's not possible to do better if you actually put more care in your queries. Most of the time multiple transactions can be eliminated. So I don't agree this is a business requirement complexity problem. It's a "it works so it's good enough" problem, or a "lazy developer" problem depending on how you want to frame it.

formerly_proven

2 months ago

This (along with n+1) is somewhat encouraged in business applications due to the prevalence of the repository pattern.

SoftTalker

2 months ago

Give each business or customer its own schema and you almost never need sharding.

n2d4

2 months ago

Yes, but you could also flip it the other way around — make the business or customer your sharding key, and you'll only need to manage one schema!

fmjrey

2 months ago

Sharding of data and compute is precisely what makes Rama [0] able to handle Internet scale topologies to create materialized views (PStates). Only one topology can write to a PState, and each PState has its own partitioning.

And yes, a developer needs to handle the added complexity of querying across partitions, but the language makes that easy.

Effectively Rama has fully deconstructed the database, not just its log, tables, and indexes, but also its query engine. It then gives the developer all the needed primitives and composition logic to handle any use case and schema.

Putting data into database silos and handling the compute separately is the schizophrenia that made everything more complicated: monolith were split into microservices, and databases into logs and nosql stores, each running in separate clusters. The way forward is to have one cluster for both data and compute, and make partitioning a first class construct of the architecture.

[0] https://redplanetlabs.com/

bawolff

2 months ago

That's great if your shards are truly independent of each other, but if not then you just invented a custom transaction layer living in your application code, which sounds way way worse than the original problem.

And quite frankly, i think it is incredibly rare for the shards to both be fine grained and independent in typical oltp DB usecase.

lucyjojo

2 months ago

this is so very very right.

i don't know how it works now but datomic used to have a single-threaded writer (transactor?, if i remember right, been a long time) and offers serialization by default.

https://docs.datomic.com/datomic-overview.html

raggi

2 months ago

We are doing this, and it’s terrible. Having done both at scale this one is worse.

user

2 months ago

[deleted]

ltbarcly3

2 months ago

Wow that's a dumb take. The whole point of ACID is that you can get roughly the same result but have a system that can serve more than 1 user at a time.