B-Trees and Database Indexes

347 pointsposted 6 days ago
by samlambert

83 Comments

hinkley

6 days ago

I realized after a few years of doing it that my strategy for keeping Wikis useful is to treat them as B-Trees.

When the landing page gets too full/too many outgoing links, I start pushing links and paragraphs down into the child pages, to leave space for a fair share of timely links and on-boarding docs.

Similar and older links get pushed down into the sibling that best represents the topic. Then if the destination page is now too big, similar and older links get pushed down to their children. Eventually all of the outdated docs are three levels down from the landing page, where only historians and experts will see them. And sometimes as we finally decide how part of the system really should work, siblings get combined into one page, minus the speculative work that gets pushed down deeper in the tree. It works remarkably well. At the end of the day documentation is a search problem.

I highly recommend it for a Friday afternoon exercise when you want to be productive but you know starting a new task is a complete waste of time.

caseyohara

6 days ago

Do you have a recommendation for Wiki software you like to use? My team is in need of an internal knowledge base, and I like the structure of wikis. Most of the SaaS products I've tried or looked at are a bit too shiny/fancy and don't seem to match my mental model of how a wiki-style knowledge base should work.

infogulch

5 days ago

Oxide Computer published their Request for Discussion (RFD) site software [1] that they use for internal published documentation and discussion. Many are published to the public, but some are private. [2] They talk about how they use it and where it came from in their most recent podcast: RFDs: The Backbone of Oxide [3].

I suspect this would be a good 'substitute' for documents that are often hosted on internal wikis.

[1]: https://github.com/oxidecomputer/rfd-site

[2] https://rfd.shared.oxide.computer/rfd/0001

[3]: https://oxide.computer/podcasts/oxide-and-friends/2065190

zelphirkalt

5 days ago

My recommendation is, if you want a wiki for developers, use something that is based on a markup language, sources pages from a git repo, and does not do too much magic behind the scenes. This will result in a more maintained and liked wiki than any bloated SaaS wiki.

hinkley

5 days ago

I don't think it really matters which you use. I've unfortunately been stuck in Atlassian for ages.

But if you were shopping for one, from the standpoint of keeping the docs working being able see missing pages and see incoming links to a page are both pretty helpful. I kinda miss the latter.

left-struck

5 days ago

If you can self host, wiki.js is easy to set up in a docker container. Mediawiki (What Wikipedia runs on) is pretty easy too.

If you have a small team, Obsidian and a syncing solution like git or obsidian sync might work.

I was able to work with my company’s it team to set up a wiki which is only accessible from within the network, including by vpn, and is hosted on a vps.

user

5 days ago

[deleted]

shnock

5 days ago

Confluence, but we're already in deep w Atlassian

yosri-xp

5 days ago

Thanks for the amazing visual, Me and my team had worked on BTree+ indexing support on the top of Aerospike as we have different huge data sets 5T of data and each data set belong to an X property which suppose to have its own order table indexing.

The challenging part was evicting the expired keys from the BTree+ where the inserted keys would have TTL therefore we decided to fuse only one level branch and within the first sibling leaf nodes as it would be expensive if we perform clean up all the way up which would cause high lock contention and slow things down substantially specially when keys get inserted/deleted/updated.

Also we had to do sharding on top of the BTree+ to speed things up and reduce the high lock contention, that way we know what shard the the keys belong to and we lock the branch before performing any CUD, That way we can perform high concurrent operations on multiple shards/branches.

The clean up process might have some caveat and the Btree+ ends up unbalanced. We had to provide rebuild indexing feature so that would fix all the gaps if necessary to avoid extra clean up operations.

Again Thanks for the visuals.

bddicken

5 days ago

You're welcome! Sharding with B tree indexes... Hmmm, I know a company that does that.

benwilber0

5 days ago

This is why you should _never_ make your UUID column the primary key.

For one, it's enormous. Now you have to copy that 128bit int to every side of the relation.

Two, in most cases it's completely random. Unless you had the forethought to use something other than UUIDv4 (gen_random_uuid). So now you just have a bunch of humongous random numbers clogging up your indexes and duplicated everywhere for no good reason.

Use regular bigserial (64bit) PKs for internal table relations and UUIDs (128bit) for application-level identifiers and natural keys. Your database will be very happy!

espadrine

5 days ago

> Use regular bigserial (64bit) PKs for internal table relations and UUIDs (128bit) for application-level identifiers and natural keys

These recommendations are database-specific. It is definitely true of MySQL, since its row storage hinges on the primary key (and other indices yield that primary key). To have a hot page cache, you need to always add to the same page, so an incremented primary index is preferred. A UUIDv4 secondary index would still have cold page caches though, defeating the purpose. (Unless using UUIDv7.)

But in CockroachDB, the data is distributed. Random INSERTs actually have a speedup over sequential ones, as those insertions can happen in parallel (on different machines). Similarly, SELECTs are faster if the primary index is random, because a single machine is not overloaded; it effectively acts as a load balancer.

franckpachot

4 days ago

It depends on the use cases and performance goals. You may want to distribute the rows that you insert, and then a random UUID makes sense. However, it is too much distributed for B-Tree indexes and the problem is not only cache but the amount of modifications due to leaf block splits. This includes MySQL which stores the primary key in a B-Tree index. Other use cases may benefit from colocating the rows that are inserted together. Think of timeseries, or simply an order entry where you query the recent orders. A sequence makes sense there, to have a good correlation between the index (on time) and the primary key. This avoids too many random reads with low cache hits.

It is wrong to think that distributed databases do not need sequences. YugabyteDB allows it. With YugabyteDB you use hash sharding to distribute them to a small number of hash ranges, so that they don0t go all at the same place, but are not scattered across the whole database. CockroachDB and Spanner doesn't have hash sharding and that's why they do not recommend sequences. There are also use cases where range sharding on the sequence is good when you don't need to distribute the data ingest, but benefit from their colocation when querying.

paperplatter

5 days ago

Yeah, Spanner is also clear about this. It doesn't even have sequences, and their docs say to use random pkeys rather than time-dependent things like uuid7.

thesz

5 days ago

If you sort UUIDs, there will be a lot of common prefixes. B-trees implicitly sort data so you can factor common prefix from all keys of B-tree page.

But!

UUIDs are random and B-tree will have increased fragmentation after a short while.

I once tried to insert a puny 1 million scale free graph edges into a B-tree (BerkeleyDB) in 1K batches and it failed miserably - I've waited for an hour and then killed it. LSM trees were an orders of magnitude faster at 100K edges, so BDB had shown that B-trees are no match there.

B-trees are semi-static data structures, they are hard to rebuild incrementally if data is random. But they shine if input keys are sorted.

Use UUIDs as you like to use them if your storage engine is LSM tree. Use staged sorting (LSM tree in disguise) if you ue B-tree.

kijin

5 days ago

You're assuming that people store UUIDs as 128-bit int. That's overly generous. I know people who use varchar without a second thought, more than doubling the storage requirement!

paulryanrogers

5 days ago

Won't you still need indexes on those UUIDs anyway? And possibly have to do more joins to resolve them?

dspillett

5 days ago

> Won't you still need indexes on those UUIDs anyway?

Depends on how index pages are structured in the DB. With MySQL (assuming the InnoDB engine) the primary key is pretty much always a clustered index, in MS SQL Server this is the case more often than not too. This means that any page expansion due to splits from the randomness of the data being inserted affects the whole table not just the key index, and rebuilding to claim back the wasted space later will take a lot longer as you are moving all your based data around. With the UUID as a supplementary key, likely indexed on its own, all that gets enlarged by excess page splitting is those 128-bit values not the entire rows and an index rebuild to fix that up after the fact moves a lot less data around.

So yes, you have an extra index on top of your primary key and other additional ones needed by your apps and reports, but not having a random UUID as your clustering key can be a significant benefit. Using more ordered UUIDs minimises this difference considerably though.

Even ignoring the random-key-causes-page-splits issue, if you've mitigated it with more ordered UUIDs, a wider primary key increases the size of all supplementary indexes assuming MySQL arranges things similarly to SQL Server: rather than having a hidden page/row identifier (as SQL Server does without any clustered key defined, it calls such arrangements heap tables), each supplementary index includes the clustering key value with every row. So while having a 32-bit primary key for internals and the 128-bit UUID as an extra key adds 4 bytes per row (for the extra INT32) to the base data, it saves 12 bytes from each row in each non-clustered index (as the INT32 is included, not the UUID).

> And possibly have to do more joins to resolve them?

Usually not. Usually when you have both the INT32 (or sometimes IN64) surrogate key is for all internal use and a UUID only for external references, so the UUIDs are only important as the initial filter and not likely not taking part in a JOIN at all. After the initial filter to find the item you want in the main table of the query, the JOINs to collect data from other tables will all be by the surrogate (INT) keys. The UUID is almost never used as a foreign key reference in this arrangement.

paperplatter

5 days ago

Yes, if a client gives a UUID and you want to look up relations to that table, you'll have to join with that table. And the index on the UUID has its own costs. It's still better this way. If for some rare reason this join is too expensive, there are other options like using a cache or denormalizing the tables slightly. The alternative of a UUID pkey will seriously slow down every join.

One question is whether you do random or k-sorted UUIDs for a secondary key. K-sorted is likely faster, but in many cases the difference is small enough that you'd rather take the easy random route which is also guaranteed not to leak any info.

benwilber0

5 days ago

You only need 1 index on the UUID. Instead of everywhere the UUID is referenced from other tables

cryptonector

4 days ago

Large keys are not a problem. See the SQLite3 docs about WITHOUT ROWID tables. Essentially any interesting multi-column key will be large, so don't worry about it. On the other hand, while INTEGER PRIMARY KEYs are nice and small, you end up with one extra index in SQLite3 if you have an integer key and also some other key that is really your primary key (or equivalently if you have the primary key you want and you didn't use WITHOUT ROWID).

A covering index always has large keys too.

In SQLite3 the trade-off made is that B-trees have either integer keys and arbitrary values, or arbitrary keys and integer values. When using WITHOUT ROWID the whole row is the key. (IIRC)

dspillett

5 days ago

> Use regular bigserial (64bit) PKs for internal table relations and UUIDs (128bit) for application-level identifiers and natural keys.

If you are worried about size ballooning, rather than the randomness, make sure you actually need more than 32-bit standard integers (well, 31-bit as they tend to be signed and we usually start from 0 or 1 not -2,147,483,648).

Avoiding roll-over issues is sometimes a valid concern sometimes with 32-bit INTs, but I've seen people specify 64-bit surrogate keys when their data is not likely to grow to where 31 would be an issue in the next few hundred years… There is an argument that 64-bit values are more efficient in modern CPUs, but I've not seen any good tests that show a measurable difference in the context of DB keys where there are other overheads to consider due to the doubling in size.

paperplatter

5 days ago

It's not just about the data size. There are reasons your sequence can increment without a row actually being inserted.

dspillett

4 days ago

Yep.

# Transactions that are explicitly rolled back, which have inserted to a table with an IDENTITY (or equivalent) column, almost always result in such a gap. Though if this is happening enough that it pushes the count up by an order of magnitude over time, you probably have a design problem…

# To reduce locking issues on the counter many DBs hold a cache of values (SQL Server allocates 1,000 at a time) in the persisted count so after a shutdown, especially an unclean shutdown, or a fail-over if using one of the clustering/mirroring/options options for HA, it is possible that most of these will be skipped. If this is happening enough to be a long-term problem than you have significant infrastructure issues. Such a cache could be per-thread/-process (IIRC it isn't in SQL Server) which is one of the reasons that IDENTITY column values can seem to be out-of-order in high concurrency environments (though you should never care about the actual order of values in such columns so that is a non-issue).

# In some DR processes where old DBs are restored, the values may be artificially updated after restore if used values might have been exposed to an external service (to avoid confusion caused is a number is used twice from the point of view of that external service). This is an extra argument against exposing internal ID values to external systems and in favour of having an additional UUID based identifier for such purposes.

# Also, though this is the most obvious one and hardly needs stating, deleted rows won't “return” their now unused number to be reused. You should scale the type based on rows expected to be inserted, not rows expected to be kept long-term.

If you are expecting hundreds of millions of rows in the lifetime of the table then a type larger than 32-bit is worth considering despite the extra storage needed in data and index pages. I'd not criticise considering it for tens of millions if you want to be more paranoid. Otherwise: 32-bit is very unlikely to not be fine, and 64-bit overkill.

Sophistifunk

5 days ago

It's very interesting to me that we have to keep telling people this, but it hasn't become part of the "hive folk knowledge" we all seem to develop. I think DB vendors have been sleeping on an opportunity to encourage better practices.

paxys

5 days ago

DB vendors haven't done enough to offer ID generation as a core part of their system. Ideally "what ID do I use for this object" shouldn't even be a consideration, because of course the database should handle it. It is the system of record after all. Yet your options are pretty much limited to UUID or a basic incremental counter that fails to meet any real world production constraints.

hobs

5 days ago

Basic incremental counters work for most real world production constraints. Most people are not going to create tables with 4.2 billion rows, even with failed inserts. If you are doing that its an extreme of either very much you know what you are doing, or you very much do not; I have seen both in production.

paxys

5 days ago

What if I have multiple partitions? Replication? What if I don't want business data to be exposed due to strictly incremental counters? What if I want unique IDs across different tables?

sgarland

5 days ago

Partitions should not impact the use of an INT PK, except that you’ll need to include the partition key in the PK, e.g. (id, created_at) if partitioning by datetime. The displayed ordering without an explicit ORDER BY may not make sense, but to be fair, there are never any guarantees about implicit order.

Replication should be fine, unless you mean active-active in which case I suggest a. not doing that b. using interleaved chunks, or a coordinator node that hands them out.

Business data exposure can be avoided (if it’s actually a problem, and not just a theoretical one) in a variety of ways; two of the most common are:

* Don’t use the id in the slug.

* Have a iid column that’s random and exposed, while keeping the integer as the PK.

If you need unique IDs across tables, then I question your use of an RDBMS, because you aren’t really making use of the relational aspect.

hobs

4 days ago

I could not have said it better myself. I would also add that I keep the slug just as an entirely separate column (or "user visible id") that they can change, had too many systems do things like "invoice id is auto generated" and then a customer coming back and saying "the invoice id has to be this or the auditors will scream!" - don't expose internals of your database to your users and you wont have a bad time.

erik_seaberg

5 days ago

A hundred inserts per second is going to hit 2^32 within a year and a half. I've seen that volume repeatedly. A colleague has seen this limit blow up prod. Do you really want to spend time on a project you're sure can never succeed to this extent?

hobs

4 days ago

Yep, seen it many times, and I cant tell you how many thousands of times I have seen tables with UUIDv4 primary keys for "future safety" that have 12 rows.

Converting int to bigint is not a big project, I have done more times than I can count just like any database evolution. I have also had customers say "oh, we'll just drop and recreate the table every year because its just some trash data." or "oh wait, we didnt mean to create 100 rows every second every day, that's a bug and costing us a lot of money for something we dont care about."

There's no one size fits all in databases, but most people who don't know better don't need to design a database for scale... because their choices won't work in the long term anyway.

erik_seaberg

4 days ago

That table is a self-limiting problem. It's true that failure is likely, but optimizing for that outcome never adds value, and it would probably make me quit.

paperplatter

5 days ago

Because we keep telling people the opposite in academic settings, where the pkey is usually some actual data field(s) you expect to be unique. There's some point to teaching 3NF this way, but it shouldn't be taken literally.

fabian2k

5 days ago

There are cases where it is really useful to have only universally unique IDs, e.g. if you have multi-tenant systems and at some point you need to move tenants to a different server/instance or merge tenants on the same DB.

weaksauce

5 days ago

they do mention that it’s is good to have both. also using a newer uuid version that is more sortable temporally is also wise.

> Use regular bigserial (64bit) PKs for internal table relations and UUIDs (128bit) for application-level identifiers and natural keys. Your database will be very happy!

wbsun

5 days ago

Isn’t there also hash based index for random keys?

benwilber0

5 days ago

Maybe? idk. Not in Postgres. The default index is a B-Tree. A hash-based index would be terrible for disk-seeking, in any case.

eurleif

5 days ago

paperplatter

5 days ago

Postgres has it, but it didn't used to, and it's still got caveats.

Beyond Postgres, indexing random values is fundamentally harder than indexing sorted ones, whether you've got a hashmap or btree or something else. The often-cited O(1) lookup complexity of a hashmap assumes everything easily fits in uniform-access memory.

benwilber0

5 days ago

Probably the worst PK index of all time. There's a reason why it's barely ever used.

srcreigh

5 days ago

Nothing wrong with 128bits primary key. Postgres is a horrible DB that can’t order rows on disk, so it doesn’t matter if you use UUIDs or not, you’re SOL.

In fact, you could use 512bits primary key with MySQL and still, range queries would be 10x faster and 10x less space in ram than in Postgres. This article explains why that is.

jashmatthews

5 days ago

Clustered tables and sequential keys have their own downsides though like lock contention on the "last" page.

srcreigh

5 days ago

Not really a problem in practice.

Most clustered tables don’t have a single last page. For ex it’s common to order a table by (used_id,pk) so a users data is grouped together. Dropbox did this

For the ones which do have a single last page, it’s easy to remove that, you don’t have to use sequential IDs. Use a uuid.

Basically this problem only happens when you cluster globally by mistake. just don’t make that mistake.

chipdart

5 days ago

> Nothing wrong with 128bits primary key. Postgres is a horrible DB that can’t order rows on disk, so it doesn’t matter if you use UUIDs or not, you’re SOL.

Why do you believe a database should order rows on disk?

srcreigh

5 days ago

In Postgres every read happens in increments of 8kb

If your rows are not ordered on disk, the amount of data you need to load to query 100 rows is insane

10kb query result (100 rows 100 bytes) requires almost 1mb of data to be loaded- it’s 99% waste

Postgres is 100x slower than other DBs for range queries

Every single other database in existence has this feature except for Postgres

stoperaticless

5 days ago

1. Faster access by that column. 2. Other DBs allow it/default to it.

is_true

5 days ago

The cookie modal doesn't work on Firefox mobile and it takes half the height.

Why don't let the user set that up on their browser

crabmusket

5 days ago

It has such an enticing "reject optional" button next to "accept all" and I was so impressed that they'd actually made the opt-out flow as easy as the opt-in flow... until I tried to use it. It's just maliciously incompetent at this point.

bddicken

4 days ago

The "reject all" button works better now, thanks for bringing this up.

crabmusket

3 days ago

Great to hear! Just checked it again on FF/Android and it works just how I'd have expected. Thanks for the update!

bddicken

5 days ago

I'm sorry about this! working on a fix.

vanderZwan

5 days ago

> it takes half the height.

Ah so that's what the "planet scale" name refers to /j

is_true

5 days ago

It takes all the screen space from latitude 0° to -90°!!!

seanman

5 days ago

I have been looking for something like this for so long, amazing post. I would love a section on composite indexes. That is something that I still have a hard visualizing…

tpetry

5 days ago

I invented a new way of visualisi g composite ones when I wrote my book about indexing. You can scroll down on the landing page to the fundamentals chapter to see it.

https://sqlfordevs.com/ebooks/indexing

bddicken

5 days ago

Thanks Sean! Yeah that would be very cool to have an interactive visual for that as well. So many possibilities!

VeejayRampay

6 days ago

beautiful interactive visualizations, this is top shelf in terms of pedagogy and vulgarization

eclectic29

6 days ago

Slightly off topic: Learnt a new word today 'vulgarization' which seems to have a completely different meaning from the obvious. Thanks.

egwynn

6 days ago

Note that, in the abstract, “vulgar” means “common” (as in “vulgar latin”). Indeed, its negative connotations come from that same sense: “common” people are unrefined.

hinkley

6 days ago

The association between vulgarity and propriety (and class distinctions) sort of ruins that word, particularly in the english speaking west.

I wonder if that's as big of a problem in the romance languages (which all treat left/right the same way - left = bad, right = good)

LtdJorge

5 days ago

Yes, in Spanish vulgar is used as inappropriate. We have "el vulgo" (el pueblo, the people), which kinda teaches you the correct meaning, popular, unrefined. But "vulgo" is seldomly used.

jjgreen

6 days ago

Indeed: are you sinister or dexterous?

nickpeterson

5 days ago

As a left-handed contrarian, I’ve always enjoyed that sinister and left handed go hand in hand.

hinkley

5 days ago

In French the same word for “right” means the same notion in English for

- direction

- straight ahead

- civics

- propriety

egwynn

5 days ago

This goes pretty deep in English. I'd argue that the semantic intention behind the colloquial usage of "vulgar" is nearly inseparable from the "class distinction" baggage it carries. Consider these common synonyms and their etymologies:

- Rude: "coarse, rough, unfinished, unlearned" (https://www.etymonline.com/word/rude#etymonline_v_16610)

- Mean: "shared by all, common, general" (https://www.etymonline.com/word/mean#etymonline_v_12495)

And even synonyms like obscene, indecent, or disgusting, which don't evoke this distinction directly, still almost always ultimately rely on separating things based on what is "good" and "clean" according to class distinctions.

bddicken

6 days ago

That's the goal! Thanks for the kind words.

sroussey

5 days ago

Awesome article!

I only wished that the reference to InnoDB storing data in the B tree itself is otherwise referred to as a clustered index.

MyISAM before it was non-clustered.

Oracle and others let you choose.

bddicken

4 days ago

Good point! Yes, clustered index is one of the correct terms.

tnvmadhav

5 days ago

great piece of education. The interactive demos like these help a lot.

dorbodwolf

5 days ago

If our disk block and B-tree node is 16k, and our keys, values, and child pointers are all 8 bits, this means we could store 682 key/values with 683 child pointers per node. A three level tree could store over 300 million key/value pairs (682 × 682 × 682 = 317,214,568). —— Should be 8 bytes per element?

misonic

5 days ago

may I ask what the v0, v1, ...v10 mean in those graphs? different pages?

bddicken

4 days ago

It's meant to represent the values associated with the keys being inserted. Having the "v" for "value" there helps to differentiate it.

s4i

5 days ago

I think it's "value #0", "value #1", etc. Confused me too.

user

6 days ago

[deleted]

modriano

5 days ago

[flagged]

samlambert

5 days ago

that's not meant to happen. we will fix.

finnh

5 days ago

wow! That was honestly even worse than you made it out to be.