EloqKV: Achieving Predictable P99.99 Latency on NVMe with Redis API

16 pointsposted 6 hours ago
by hubertzhang

5 Comments

hubertzhang

6 hours ago

Most Redis alternatives that use disk for persistence struggle with tail latency (P9999) due to background maintenance or OS filesystem overhead. We built EloqKV on a custom storage engine, EloqStore, to solve this.

Key Architectural Choices:

- Custom B-tree Variant: Unlike LSM-trees used in many disk-backed stores, our B-tree variant avoids the "compaction stalls" that typically cause high tail latency during heavy writes.

- Coroutines & io_uring: We leverage io_uring for asynchronous I/O and use coroutines to manage thousands of concurrent I/O requests without the context-switching overhead.

- Object Storage Integration (optional): EloqStore uses object storage as the primary persistent layer, with NVMe acting as a high-speed cache/tier, providing durability without sacrificing speed.

We’ve reached a point where we can provide predictable P99.99 latency even when the working set is primarily on NVMe. We’d love to answer any questions about the storage internals or our benchmarking process.

the_precipitate

6 hours ago

With DRAM price this high, this is certainly a welcome feature. But how do you control write latency? B+ Tree is pretty bad at updates and LMDB, another BTree based storage is lightning fast on reads but quite bad on writes compared with RocksDB.

iamlintaoz

6 hours ago

The disk storage EloqKV uses (EloqStore [1]) is optimized for batch updates because the upper Data Substrate layer manages buffering and the Write-Ahead Log (WAL), absorbing writes and guaranteeing durability. When durability is not required, the WAL can be optionally disabled.

[1] github.com/eloqdata/eloqstore

Disclaimer: I am the CEO of EloqData

hubertzhang

6 hours ago

we leverage batch write optimization which uses Copy-on-write B-tree variant enables high-throughput batch writes without blocking concurrent reads. MVCC-based design eliminates lock contention and provides predictable write amplification.

user

6 hours ago

[deleted]