bcrl
2 months ago
There's a simple solution: don't use mmap(). There's a reason that databases use O_DIRECT to read into their own in memory cache. If it was Good Enough for Oracle in the 1990s, it's probably Good Enough for you.
mmap() is one of those things that looks like it's an easy solution when you start writing an application, but that's only because you don't know the complexity time bomb of what you're undertaking.
The entire point of the various ways of performing asynchronous disk I/O using APIs like io_uring is to manage when and where blocking of tasks for I/O occurs. When you know where blocking I/O gets done, you can make it part of your main event loop.
If you don't know when or where blocking occurs (be it on I/O or mutexes or other such things), you're forced to make up for it by increasing the size of your thread pool. But larger thread pools come with a penalty: task switches are expensive! Scheduling is expensive! AVX 512 registers alone are 2KB of state per task, and if a thread hasn't run for a while, you're probably missing on your L1 and L2 caches. That's pure overhead baked into the thread pool architecture that you can entirely avoid by using an event driven architecture.
All the high performance systems I've worked on use event driven architectures -- from various network protocol implementations (protocols like BGP on JunOS, the HA functionality) to high speed (persistent and non-persistent) messaging (at Solace). It just makes everything easier when you're able to keep threads hot on locked to a single core. Bonus: when the system is at maximum load, you remain at pretty much the same number of requests per second rather than degrading as the number of threads ready to run starts increasing and wasting your CPU resources needlessly when you need them most.
It's hard to believe that the event queue architecture I first encountered on an Amiga in the late 1980s when I was just a kid is still worth knowing today.
grep_it
2 months ago
leo_e
2 months ago
You're right. O_DIRECT is the endgame, but that's a full engine rewrite for us.
We're trying to stabilize the current architecture first. The complexity of hidden page fault blocking is definitely what's killing us, but we have to live with mmap for now.
bcrl
2 months ago
I am curious -- what is the application and the language it's written in?
There are insanely dirty hacks that you could do to start controlling the fallout of the page faults (like playing games with userfaultfd), but they're unmaintainable in the long term as they introduce a fragility that results in unexpected complexity at the worst possible times (bugs). Rewriting / refactoring is not that hard once one understands the pattern, and I've done that quite a few times. Depending on the language, there may be other options. Doing an mlock() on the memory being used could help, but then it's absolutely necessary to carefully limit how much memory is pinned by such mappings.
Having been a kernel developer for a long time makes it a lot easier to spot what will work well for applications versus what can be considered glass jaws.
man8alexd
2 months ago
There is a database that uses `mmap()` - RavenDB. Their memory accounting is utter horror - they somehow use Commited_AS from /proc/meminfo in their calculations. Their recommendation to avoid OOMs is to have swap twice the size of RAM. Their Jepsen test results are pure comedy.
otterley
2 months ago
LMDB uses mmap() as well, but it only supports one process holding the database open at a time. It's also not intended for working sets larger than available RAM.
hyc_symas
2 months ago
Wrong, LMDB fully supports multiprocess concurrency as well as DBs multiple orders of magnitude larger than RAM. Wherever you got your info from is dead wrong.
Among embedded key/value stores, only LMDB and BerkeleyDB support multiprocess access. RocksDB, LevelDB, etc. are all single process.
otterley
2 months ago
My mistake. Doesn’t it have a global lock though?
Also, even if LMDB supports databases larger than RAM, that’s it doesn’t mean it’s a good idea to have a working set that exceeds that size. Unless you’re claiming it’s scan resistant?
hyc_symas
2 months ago
It has a single writer transaction mutex, yes. But it's a process-shared mutex, so it will serialize write transactions across an arbitrary number of processes. And of course, read transactions are completely lockfree/waitfree across arbitrarily many processes.
As for working set size, that is always merely the height of the B+tree. Scans won't change that. It will always be far more efficient than any other DB under the same conditions.
otterley
2 months ago
> As for working set size, that is always merely the height of the B+tree.
This statement makes no sense to me. Are you using a different definition of "working set" than the rest of us? A working set size is application and access pattern dependent.
> It will always be far more efficient than any other DB under the same conditions
That depends on how broadly or narrowly one defines "same conditions" :-)
hyc_symas
2 months ago
Identical hardware, same RAM size, same data volume.
otterley
2 months ago
That’s a bold claim. Are you saying that LMDB outperforms every other database on the same hardware, regardless of access pattern? And if so, is there proof of this?
hyc_symas
2 months ago
Plenty of proof. http://www.lmdb.tech/bench/
hyc_symas
2 months ago
You don't have to take my word for it. Plenty of other developers know. https://www.youtube.com/watch?v=CfiQ0h4bGWM
otterley
2 months ago
Since the first question of my two-part inquiry not explicitly answered in the affirmative: To be absolutely clear, you are claiming, in writing, that LMDB outperforms every other database there is, regardless of access pattern, using the same hardware?
hyc_symas
2 months ago
Not every.
LMDB is optimized for read-heavy workloads. I make no particular claims about write-heavy workloads.
Because it's so efficient, it can retain more useful data in-memory than other DBs for a given RAM size. For DBs much larger than RAM it will get more useful work done with the available RAM than other DBs. You can examine the benchmark reports linked above, they provide not just the data but also the analysis of why the results are as they are.