hackernews client

Show HN: EloqKV – Scalable distributed ACID key-value database with Redis API

44 pointsposted 9 months ago

32 Comments

apavlo

9 months ago

> EloqKV brings significant innovations to database design

What is the novel part? I read your "Introduction to Data Substrate" blog article and the architecture you are describing sounds like NuoDB from the early 2010s. The only difference is that NuoDB scales out the in-memory cache by adding more of what they call "Transaction Engine" nodes whereas you are scaling up the "TxMap" node?

See also Viktor Leis' CIDR 2023 paper with the Great Phil Bernstein:

* https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf

* https://youtu.be/tiMvcqIfWyA

eloqdata

9 months ago

Thank you, Andy! This is Jeff, CEO and Chief Architect of EloqData. It's a great honor for us to have THE Andy Pavlo join the discussion on our first HN submission.

If I remember correctly, NuoDB uses a shared cache with a cache coherence protocol, whereas EloqKV uses a shared nothing (partitioned) cache. The former is a local read but needs to broadcast each write to all nodes. The latter has no broadcast for writes but may be a remote read. The tradeoff is evident and we are actively exploring opportunities to strike a balance, e.g., for frequently-read, rarely-write data items, use the shared cache mode.

We appreciate you pointing us to the CIDR paper. I had the pleasure of working with Phil for some time and fondly remember many discussions with Phil on various topics many years ago. To address your question, yes, we've been trying to solve the research challenges presented in the CIDR paper. The devil is in the details. We've developed numerous new algorithms and invested significant engineering effort into the design and implementation of our products. The benefits are as follows:

- Optimality: We believe we have an overall design that optimizes synchronous disk writes and network round-trips. For instance, when the design is reduced to a single node, its performance matches or exceeds that of single-node servers. As you might expect, a lot of innovation has gone into making distributed transactions as efficient as non-distributed ones, comparable to those in MySQL or PostgreSQL.

- Modularity: Our architecture allows us to easily replace the Parser/Compute layer and Storage/Persistence layer with the best existing solutions. This means we can create new databases by leveraging existing parsers and compute engines from current database implementations to achieve API-compatibility, as well as leveraging existing high-performance KV stores for the persistence layer. This allows us to avoid reinventing the wheels and to take advantage of decades of innovations in the database community.

- Scalability: The entire system operates without a single synchronization point—not even a global sequencer. We drew many inspirations from the Hekaton and your TicToc paper. All four types of resources (CPU, Memory, Storage, Logging) can be scaled independently, as we mentioned earlier. More importantly, they can scale dynamically to accommodate workload changes without service disruptions.

We look forward to sharing more technical details as we move out of stealth mode. I hope to continue this conversation with you in person in the near future.

dbms1

9 months ago

That is really interesting. It indeed reminds me a bit on this research project https://github.com/DataManagementLab/ScaleStore (the one Andy posted the analysis)

fizx

9 months ago

Redis's transaction api is terrible, and doesn't work across shards. Any reasonable transactions are done in Lua, and because Lua mostly works well, there's not a lot of pressure to fix transactions.

However, if you're giving redis access to different tenants, Lua is too dangerous.

I'd love to see a "real" transaction API for Redis.

eloqdata

9 months ago

Thank you for the insightful reply. I agree, Redis's transaction API ("MULTI/EXEC/DISCARD/WATCH") is challenging compared to SQL's familiar "START/COMMIT/ROLLBACK," and it has key limitations:

1. No rollback on failure: If a command in EXEC fails, Redis can't roll back the transaction. EloqKV resolves this by starting real transactions on the TxServer node during EXEC. If any command fails, the changes are rolled back, maintaining atomicity. Additionally, EloqKV allows specifying isolation levels when using Redis transactions.

2. Lack of interactivity: This is where Lua shines. Users can embed business logic in Lua, functioning similarly to stored procedures in SQL.

3. No cross-shard transactions: Redis clusters can't redirect requests across shards, forcing clients to manage topology awareness. EloqKV addresses this with a fully distributed transactional design, eliminating these cross-shard complexities. For more on this, see our blog.

https://www.eloqdata.com/blog/2024/08/22/benchmark-cluster#w...

As for your point on Lua's risks in multi-tenant environments, I completely agree. Lua lacks robust ACLs or resource limitations, and managing SHA keys is cumbersome. We're considering enhancements to address these issues in the future. Finally, we are indeed working on a "real" transaction API for Redis. Like SQL, users will be able to begin transactions, read keys, apply transformations, and generate new keys, with the ability to commit or abort. EloqKV will also support configurable isolation levels and transaction protocols (OCC/Locking).

Do you have any additional API preferences beyond "START/COMMIT/ROLLBACK" for Redis transactions?

namibj

9 months ago

A way to tie transaction success to another system's transaction success is a powerful ability to have. Think Kafka watermarks and just general 2-phase commit transaction systems.

fizx

9 months ago

My dream here is probably webassembly transactions with bounded resource consumption

netpaladinx

9 months ago

Talking about transactions in Redis, one area came on top of my head is metadata in file systems. I've seen colleagues/collaborators run large-scale training on a distributed file system w/ a billion files, which puts a lot of pressure on the metadata part. They tried a few options and Redis was one of them. It's fast and Lua is good enough to support metadata ops. But the thing is that it cannot scale (or Lua is gone) and may lose data from time to time, which is annoying. It looks like this durable-transactional combination may fit in. Will wait to see how this is unfolded.

the_precipitate

9 months ago

This is really interesting! The performance looks impressive. I’ve always struggled with the Redis/MySQL two-tier architecture because they are two completely different systems. Porting an application is always tricky due to API incompatibility, and maintaining consistency between the two is a huge hassle. If there’s any issue on the cache side (like performance jitters or a few servers going down), the DBMS part can quickly get overwhelmed. This kind of cascading failure is a common problem in distributed systems. There are some great discussions on this topic over at https://danluu.com/cache-incidents/. BTW, although the formatting is a bit rough, it's a great read for anyone interested in caching. Back to the topic, I always hoped that new advancements in database systems can alleviate or eliminate this problem, your work seems to be on the right track.

hubertzhang

9 months ago

Thank you for your interest. I agree—if Redis (as a cache) and MySQL (as a store) are deployed separately, we can only achieve eventual consistency. That’s precisely why we’re integrating cache and storage into a single system, much like EloqKV.

jacobn

9 months ago

Jepsen test?

hubertzhang

9 months ago

Yes, this is a top priority in our mind, and we aim to pass the Jepsen test before official GA release.

hodr_super

9 months ago

It's interesting that you have a Cassandra version. I usually think of Cassandra as a standalone database. Why would you layer one database on top of another?

hubertzhang

9 months ago

Good question! As explained in the linked article, our core technology, Data Substrate, is modular by design and requires a storage layer to handle the actual data (more precisely store the checkpoints). This storage layer can be any data store with a Key-Value interface: it doesn't need to meet the consistency and isolation requirements of a full database.

We chose RocksDB and Cassandra to showcase two different approaches. RocksDB offers efficient local storage, but if there's a disk hardware failure, the data can be lost. Cassandra, on the other hand, ensures high availability by replicating data, making it resilient to disk failures. We also support cloud storage options like DynamoDB and BigTable in our cloud offerings.

henning

9 months ago

Show us the Jepsen tests. Hire Aphyr to test your database.

hubertzhang

9 months ago

Yes, this is a top priority for us, and we aim to pass the Jepsen test before the official GA release.

hodr_super

9 months ago

It does seem that there are a lot of activities to replace/enhance/improve Redis lately. Dragonfly, Garnet from Microsoft, and the BSD licensed Redis fork Valkey, to name just a few. Now this.

hubertzhang

9 months ago

Yeah, it's just a happy coincidence that our product release lined up with the Redis license change. We think EloqKV is a way more powerful and versatile product than Redis. It can definitely be used as a simple in-memory cache, just like Redis, Valkey, or Dragonfly. But the real strength of EloqKV is in its full database capabilities: scalability, consistency, and high availability, all the things you'd expect from a modern database.

PeterZaitsev

9 months ago

What is license ? I see Download but it is not clear if it is Open Source

hubertzhang

9 months ago

We are not yet open sourcing the code on our github site yet. The current release is a prebuilt binary, user can do whatever you want with the software (there is a license file in the downloaded tarball), EloqKV is not yet production ready, so please don't use it in a production environment.

We are a small team of developers, and currently EloqKV is still under heavy development. We would like to maintain agility and quickly iterate and improve our product with consistency with our customer's feedback. However, we are actively evaluating multiple paths to open sourcing our technology and any suggestions and concerns will certainly be kept in our mind as we progress.

the_precipitate

9 months ago

Playing devil’s advocate here, I do have some doubts about the future dynamics between open source and database systems. Andy Pavlo mentioned in his CMU DB course that nowadays new database companies are either fully closed-source or only open a small portion of non-essential code.

If you think about it, databases are arguably a more critical piece of IT infrastructure than cybersecurity products. Yet, while cybersecurity companies are thriving with many company valuations exceeding $10 billion, database companies—especially those embracing open source—struggle to achieve similar commercial success. The few database companies that do reach high valuations are typically based on closed-source products, while those adopting open-source models often face significant challenges in becoming profitable.

It's also worth noting that many companies in the database space have recently changed their licenses around open source or source availability. If major players like MongoDB, ElasticSearch, and Redis are all making these tough decisions to build sustainable businesses, it might not be fair to simply blame corporate greed. Without adequate returns on investment, venture funding for database development could dry up, which would ultimately stifle innovation in this crucial sector.

hodr_super

9 months ago

If it's not open-sourced, I’m not going to use it to store my critical data. Just saying...

hubertzhang

9 months ago

Note that this binary release is a preview of EloqKV. We're actively developing a cloud-native version with seamless scalability and a serverless experience. Stay tuned—it's coming soon!

gregwebs

9 months ago

Is the WAL 1:1 with a TxMap instance? Is the TxMap 1:1 with a node? For a distributed transaction with distributed storage and multiple nodes how does a transaction get coordinated?

hubertzhang

9 months ago

Good question! WAL and TxService(TxMap) are fully decoupled, allowing you to deploy them on the same node or across different machines. If your workload is write-intensive, you might consider a large WAL cluster, which leverages multiple disks to perform fsync operations in parallel. This is particularly important in cloud environments, where local SSDs are ephemeral and cloud-native databases often use EBS for WAL. However, high-performance EBS options like io2 can be costly. EloqKV's decoupled architecture allows you to scale write throughput up to 600K Write OPS on a single TxServer as you increase the number of cheap gp3 disks. Since WAL logs can be truncated after a checkpoint, the required disk size is typically quite small, making gp3 a cost-effective choice.

For more details, please check our scaling disks benchmark report.

https://www.eloqdata.com/blog/2024/08/25/benchmark-txlog#exp...

> For a distributed transaction with distributed storage and multiple nodes how does a transaction get coordinated?

EloqKV is a multi-writer system, similar to many distributed databases (FoundationDB, TiDB, CockroachDB), but we have a set of new innovations on transaction protocols, for example, the entire system operates without a single synchronization point, not even a global sequencer. We drew many inspirations from the Hekaton and the TicToc paper.

AtlasBarfed

9 months ago

Yet another distributed database announcement with nary a mention of CAP tradeoffs. Then again, no sales pitch starts with the limitations. Also, one man's ACID isn't another person's. At least they aren't advertising SQL joins with handwaves to the distributed complexities.

It superficially sounds like a series of server processes fronting actual database servers, which sounds like another layer of partition vulnerability and points of failure. But I also had similar high level concerns about the complexity of FoundationDB, people seem satisfied as to the validity of that architecture.

I fail to see how one would scale underlying resources if the persistence is done by various storage systems, and you'd be subject to limitations of those persistence engines. That sounds like a suspect claim, like the "Data Substrate" is a big distributed cache that scales in front of delayed persistence to an actual database. Again, sounds like oodles of opportunities for failure.

"Data substrate draws inspirations from the canonical design of single-node relational database management systems (RDBMS)." Look, I don't think a good distributed database starts from a CA system and bolts on partition tolerance. I get you can get far with big nodes, fast networks, and sharding but ... I mean, they talk about B+ trees but a foundational aspect of Cassandra and other distributed databases is that B+ trees don't scale past a certain point, because there exist trees so deep/tall that even modern hardware chokes on it, and updating the B+ tree with new data gets harder as it gets bigger.

As others have said, I'll leave it to Aphyr to possibly explain.

eloqdata

9 months ago

You're absolutely right—discussing the CAP theorem is essential for any distributed storage system. We are preparing a detailed blog post to cover the implications of CAP in our system. In brief, our system generally aligns with PC/EC in PACELC terms (https://en.wikipedia.org/wiki/PACELC_theorem), similar to other fully distributed databases like CockroachDB and TiDB. However, this configuration is flexible. For instance, in high-performance cache applications where durability is less critical, our system can be adjusted to favor PA/EL, akin to many NoSQL systems.

Regarding persistence, our Substrate component manages its own logging system while relies on external storage engines for checkpointing. This means that the system only depends on the performance and capacity of the external storage. Our database does not depend on the external storage engine for consistency. For example, DynamoDB offers virtually unlimited capacity and performance (in terms of QPS) in the cloud, and our Data Substrate is agnostic to whether the data is stored as a B-Tree or an LSM tree.

We are currently preparing the software for Jepsen testing prior to its General Availability (GA) release. More technical details will be shared as we move forward, so please stay tuned.

hubertzhang

9 months ago

> It superficially sounds like a series of server processes fronting actual database servers, which sounds like another layer of partition vulnerability and points of failure

I agree that introducing a middleware layer on top of databases adds more points of failure. EloqKV avoids this by integrating Compute, Logging, and Storage into a single system. In this setup, storage can be a database like Cassandra, but users will not access it directly; all requests go through EloqKV, which manages ACID transactions. EloqKV is responsible for handling crashes of the TxServer, LogServer, and Storage. You can think of the distributed Storage Engine just as a Disk in traditional DBMS. Obviously its failure will affect the system. But no more than hard disk failure. In fact, in a cloud, all disks (i.e. EBS) are actually distributed storage systems.

This situation is a rethinking of the Redis and MySQL combination, which suffers from similar issues. Both systems can fail independently, resulting in only eventual consistency. EloqKV aims to resolve this problem.