hackernews client

K8s with 1M nodes

284 pointsposted 4 months ago

(bchess.github.io)

84 Comments

turbobrew

4 months ago

I feel like etcd is one of the few use cases where Intel Optane would actually make sense. I build and run several bare metal clusters with over 10k nodes and etcd is by and large the biggest pain for us. Sometimes an etcd node just randomly stops accepting any proposals which halts the entire cluster until you can remove the bad etcd node.

From what I remember, GKE has implemented an etcd shim on top of spanner as a way to get around the scalability issues, but unfortunately for the rest of us who do not have spanner there aren’t any great options.

I feel like at a fundamental level that pod affinity, antiaffinity, and topology spreads are not compatible with very large clusters due to the complexity explosion in large clusters.

Another thing to consider is that the larger a cluster becomes, the larger the blast radius is. I have had clusters of 10k nodes spectacularly fail due to code bugs within k8s. Sharding total compute capacity compute capacity into multiple isolated k8s clusters reduces the likelihood that a software bug is going to take down everything as you can carefully upgrade only a single cell at a time with bake periods between each cell.

jen20

4 months ago

AFAIK all the hyperscalers have replaced etcd for their managed Kubernetes services [1], [2], [3] - though Azure is the least clear about what they actually use currently.

[1]: https://aws.amazon.com/blogs/containers/under-the-hood-amazo...

[2]: https://cloud.google.com/blog/products/containers-kubernetes...

[3]: https://azure.microsoft.com/en-us/blog/a-cosmonaut-s-guide-t...

jbnorth

4 months ago

While I can’t speak for the others, AWS doesn’t replace all of etcd. Only the raft consensus layer which is replaced with Journal which is an internal AWS service.

fowl2

4 months ago

Interestingly the public of Azure’s etcd-compatible service was withdrawn before exiting preview.

[1] https://learn.microsoft.com/en-us/answers/questions/154061/a...

jen20

4 months ago

It's interesting they ever exposed it at all really! I don't think you can use Google's Spanner-based etcd replacement for a self-managed Kubernetes cluster, for example.

femiagbabiaka

4 months ago

Yep, every cluster approaching 10k I know of has either pared back etcd's durability guarantees or rewritten and replaced it in some manner. Actually the post goes into detail about doing this exactly, the Alibaba paper they reference says about the same.

> Sharding total compute capacity compute capacity into multiple isolated k8s clusters reduces the likelihood that a software bug is going to take down everything as you can carefully upgrade only a single cell at a time with bake periods between each cell.

Yeah, I've been meaning to try out something like Armada to simplify things on the cluster-user side. Cluster-providers have lots of tools to make managing multiple clusters easier but if it means having to rewrite every batch job..

jamesblonde

4 months ago

Is it throughput and latency that are the etcd bottlenecks? Our database, RonDB, is an in-memory open-source database (a fork of MySQL Cluster). We have scaled it to 100m reads/sec on AWS hardware (not even top of the line). Might be an interesting project to implement an open-source etcd shim on top of it?

Reference: https://www.rondb.com/post/100m-key-lookups-sec-with-rest-ap...

davidgl

4 months ago

See https://github.com/k3s-io/kine, k3s uses this to shim etcd to MySQL, Postgres and sqlite

nonameiguess

4 months ago

The setting is configurable, but by default, etcd's Raft implementation requires a voting node to write to disk before it makes a vote, as in actually flushing to disk, not just writing to the file cache. Since you need a majority vote before a client can get a response, this is why it's strongly recommended you use the fastest possible disks, keep the nodes geographically close to each other, and etcd's default storage is only 2GB per node.

All in all, it was a poor choice for Kubernetes to use this as its backend in the first place. Apparently, Google uses its own shim, but there is also kine, which was created a long time ago for k3s and allows you to use a RDBMS. k3s used sqlite as its default originally, but any API equivalent database would work.

We should keep in mind etcd was meant to literally be the distributed /etc directory for CoreOS, something you would read from often but perform very few writes to. It's a configuration store. Kubernetes deciding to also use it for /var was never a great idea.

jamesblonde

4 months ago

RonDB uses a non-blocking 2PC algorithm - commits in memory, and then does a group commit of transactions to disk every 500ms. This means it can handle insane write throughput, as well as read throughput. However, if both your DB nodes fail, you could lose 500ms of data - which is not the end of the world for k8s. Normally, you would locate DB nodes in different AZes, reducing the probabilty of correlated failures.

turbobrew

4 months ago

At that point it is apples to oranges. One of the main reasons why etcd writes are slow is because they are guaranteed to be durably persisted across the quorum.

If you just turned off file system syncs in etcd you could probably get an order of magnitude better performance as well.

torginus

4 months ago

It,s nice to know that the upper bound of the resiliency of a k8s cluster is the amount of redundancy etcd has - which is in essence a horizontally scaled monolith.

jeffinhat

4 months ago

This is an awesome experiment and write up. I really appreciate the reproducibility.

I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.

I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.

I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!

ymelnyk

4 months ago

Here you go, Kine FoundationDB backend https://github.com/melgenek/f8n

To be honest, I was building it with the purpose of matching the Etcd scale, but making foundationdb a multitenant data store.

But with the recent craze of scalability , I'll be investing time into understanding how far foundationdb can be pushed as a K8s data store. Stay tuned.

jeffinhat

4 months ago

Awesome, I will!

It would be great to see where the limits are with this approach.

I think at some point, you need to go deeper into the apiserver for scale than an API compatible shim, but this is just conjecture and not real data.

pluc

4 months ago

> Early on in this project, I asked ChatGPT “I want to scale Kubernetes to 1 million nodes. What types of problems would I need to overcome?”

click

breakingcups

4 months ago

It's a shame the author led with something that carries about the same authority as a horoscope, since the rest of the article is actually quite interesting.

tbrockman

4 months ago

If you have sufficient knowledge in the subject matter you're questioning ChatGPT about, you can fairly reliably discern complete bullshit from something plausibly true that warrants additional investigation (which I'd say is more useful than your typical horoscope). In isolation it seems worth the gamble to me, so long as you don't view it as much more than consulting the tea leaves.

pluc

4 months ago

You could also spend 5 minutes thinking instead of using a 10th of the Texas power grid to do it for you. Remember when we used to not know shit, and it would stay with you for days - before Google happened? The same is about to happen with mental acuity and AI. Use it or lose it - and I hope every fanboys' brain turns to mush before they witness irreversible AI-caused climate catastrophes solely to be able to speak natural english to a search engine. If you outsource your thinking process, don't come bitching when it's gone.

tbrockman

4 months ago

Yeah, I’m actually fairly sympathetic to your perspective, that’s why I said “in isolation”. I might think it’s possible to balance use in a way that’s not as detrimental to myself personally or others (self-hosted models, judicious use, etc.), but I definitely don’t disagree that it seems likely to do more harm than good currently.

wppick

4 months ago

If you don't need the isolation of of k8s then don't forget about erlang, which is another option to scale up to 1 million functions. Obviously k8s containers (which are fundamentally just isolated processes) and erlang processes are not interchangeable things, but when thinking about needing in the order of millions of processes erlang is pretty good prior art

theptip

4 months ago

This is 1m nodes, you typically run tens or hundreds of pods per node, each with one or more containers. So more like 100m+ functions if I follow the Erlang analogy correctly?

reactordev

4 months ago

This is not analogous. It’s just someone beating the Erlang drum. You can’t PyTorch in Erlang.

wppick

4 months ago

Kubernetes is way heavier than Erlang’s lightweight processes, so for millions of tasks at scale, a middle-ground solution could blend Erlang’s concurrency efficiency with k8s’ orchestration power, dodging containers’ overhead while keeping flexibility for diverse workloads. That's if you don't actually need the strict isolation of pods/containers and you're just trying to run something at massive scale. I don't get why so many people want to run everything as heavy container processes or pods vs coming up with a better solution. The point is we don't have to fit every problem into the shoe called kubernetes if it doesn't seem to fit, and we should look at other ways to spin up millions of processes

victorbjorklund

4 months ago

There are similar libraies in Elixir. Is the ecosystem for ML as developed as for python? Nope, but not every ML project needs the most obscure libraries etc.

(For the record I don't really see Erlang clusters as a replacement for k8s)

reactordev

4 months ago

You aren’t going to teach a research scientist Erlang in spite of Python and R. You aren’t going to win that fight, ever.

victorbjorklund

4 months ago

Oh totally, that is not anyone's goal I think. The goal is more about enabling teams already running Elixir to also do their ML stuff in Elixir or teams that needs to do something that would benefit from the BEAM actor model but also need to do some AI stuff to do it all in Elixir. Lots of neural networks created in Python or R can then run in Elixir.

fcarraldo

4 months ago

You can with Pyrlang or whatever other cursed implementation of Python on top of the Erlang VM you’d prefer.

fcarraldo

4 months ago

I don’t think there are very many k8s clusters running 100s of pods per node. The default maximum is 110. You can, of course, scale beyond this, but you’ll run into etcd performance issues, IP space issues, max connection, IOPS and networking limitations for most use cases.

At 1M nodes I’d still expect an average of a dozen or so pods per node.

sally_glance

4 months ago

Agree this is a consideration if your only workload is an existing or greenfield ErlangVM-compatible project.

From what I know basically everyone approaching this scale with k8s has different problems to solve, namely multi-tenancy (shared hosting/internal plattform providers) and compatibility with legacy or standard software.

czhu12

4 months ago

If anyone is looking for a gentler, Heroku like onramp to Kubernetes, its exactly why I built Canine [1].

In retrospect, at my previous company, what we really needed in the early days was something that was Heroku-like (don't make me think about infra (!)) but could be easily added to and scaled up over time, as our service grew. We eventually grew to about 10M users, using the site monthly, and had to do a huge effort to migrate to Kubernetes.

Canine's philosophy is: full Kubernetes, with a deployment layer on top. If you ever out grow it, just dump Canine entirely, and work directly with the Kubernetes system it's operating. It even gives you all the K8s YAML config needed to offboard.

It's also similar to how the dev infra works at Airbnb (where I worked before that) -- Kubernetes underneath, a user friendly interface on top.

rootlocus

4 months ago

You're not the first one to think that k9 is a good name for a kubernetes related technology https://k9scli.io

kawsper

4 months ago

That's really impressive and an interesting experiment.

I was about to say that Nomad did something similar, but that was 2 million Docker containers across 6100 nodes, https://www.hashicorp.com/en/c2m

vebgen

4 months ago

This is an absolutely incredible technical deep-dive. The section on replacing etcd with mem_etcd resonates with challenges we've been tackling at a much smaller scale building an AI agent system.

A few thoughts:

*On watch streams and caching*: Your observation about the B-Tree vs hashmap cache tradeoff is fascinating. We hit similar contention issues with our agent's context manager - switched from a simple dict to a more complex indexed structure for faster "list all relevant context" queries, but update performance suffered. The lesson about O(1) writes vs O(log n) reads being the wrong tradeoff for high-write workloads is universal.

*On optimistic concurrency for scheduling*: The scatter-gather scheduler design is elegant. We use a similar pattern for our dual-agent system (TARS planner + CASE executor) where both agents operate semi-independently but need coordination. Your point about "presuming no conflicts, but handling them when they occur" is exactly what we learned - pessimistic locking kills throughput far worse than occasional retries.

*The spicy take on durability*: "Most clusters don't need etcd's reliability" is provocative but I suspect correct for many use cases. For our Django development agent, we keep execution history in SQLite with WAL mode (no fsync), betting that if the host crashes, we'd rather rebuild from Git than wait on every write. Similar philosophy.

The mem_etcd implementation in Rust is particularly interesting - curious if you considered using FoundationDB's storage engine or something similar vs rolling your own? The per-prefix file approach is clever for reducing write amplification.

Fantastic work - this kind of empirical systems research is exactly what the community needs more of. The "what are the REAL limits" approach vs "conventional wisdom says X" is refreshing.

up2isomorphism

4 months ago

“Perhaps my spiciest take from this entire project: most clusters don’t actually need the level of reliability and durability that etcd provides.”

This assumption is completely out of touch, and is especially funny when the goal is to build an extra large cluster.

itsnowandnever

4 months ago

etcd is also the entire point of k8s. that it's a single self-contained framework and doesn't require an external backer service. there is no kubernetes without etcd. much of the "secret sauce" of kubernetes is the "watch etcd" logic that "watches" desired state and does the cybernetic loop to bring the observed state adhere to the desired state.

trenchpilgrim

4 months ago

The API and controller loops are the point of k8s. etcd is an implementation detail and lots of clusters swap it out for something else like sqlite. I'm pretty sure that GCP and Azure are using Spanner or Cosmos instead of etcd for their managed offerings.

alphabettsy

4 months ago

Yep. K3s can use SQLite or Postgres.

user

4 months ago

[deleted]

itsnowandnever

4 months ago

not exactly a fair assessment since neither of those were out and/or available to the kubernetes team at the time. sure, some things at many times from now into eternity may be or become better suited for the kubernetes data plane but at the time if etcd wasn't used there would be no kubernetes today

trenchpilgrim

4 months ago

The Kubernetes team chose etcd specifically because they were trying to replace Borg's master/slave database at Google. Nothing about Kubernetes requires etcd; the team was trying to solve a Google-internal problem with it (and in the end, didn't gain traction within Google.) k3s uses sqlite by default which was an option at the time, other clusters today use PostgreSQL.

Have you looked at the etcd keys and values in a Kubernetes cluster? It's a _remarkably_ simple schema you could do in pretty much any database with fast prefix or path scans.

moondev

4 months ago

Upstream kubernetes literally requires etcd. Anything that changes that is a fork

trenchpilgrim

4 months ago

There are many forms of Kubernetes that are not using etcd that are certified by the CNCF as conformant (https://www.cncf.io/training/certification/software-conforma...). The version in the kubernetes/kubernetes repository is a reference implementation, intended to be customized by the community.

user

4 months ago

[deleted]

jauntywundrkind

4 months ago

The API server is the thing. It so happens that the API server can mostly be a thin shell over etcd. But etcd itself while so common is not sacrosanct.

https://github.com/k3s-io/kine is a reasonably adequate substitute for etcd. sqlite, MySQL, PostgreSQL can also be substituted in. Etcd is from the ground up built to be more scale-out reliable, and that rocks to have baked in. But given how easy it is to substitute etcd out, I feel like we are at least a little off if we're trying to say "etcd is also the entire point of k8s" (the APIserver is)

dmlittle

4 months ago

It's been a while since I've checked this but a few years ago we tried to limit test kine on a large-ish cluster and it performed pretty poorly. It's fine for small clusters but the way they have to implement the watch semantics makes it perform poorly (at least this was the case a few years ago).

jauntywundrkind

4 months ago

Agreed. The subscriptions really is a huge huge part of the magic, and it's a weakpoint of Kine. Thanks for chiming in.

Ideally, i'd love to see a database specific offering. Use postgres async replication (ideally somehow sharded so there's not a single consumer node) to some fan out system that's doing all the watching.

But etcd mostly does the job, seems unlikely to be going anywhere. It's be cool though.

itsnowandnever

4 months ago

that's fair but that 99% of all apiserver deployments in the world have the same standard boilerplate footprint is a large part of why it became so ubiquitous. that people running it locally don't have to make any decisions about how to deploy which database or why to use this one over that one... and that's also the same situation in production so people doing stuff in dev aren't punched in the face by an exponentially more complex system in production is huge.

user

4 months ago

[deleted]

geoctl

4 months ago

Is it? I honestly kinda believe that etcd is probably the weakest point in vanilla k8s. It is simply unsuitable for heavy write environments and causes lots of consistency problems under heavy write loads, it's generally slow, it has value size constraints, it offers very primitive querying, etc... Why not replace etcd altogether with something like Postgres + Redis/NATS?

itsnowandnever

4 months ago

that touches on what I consider the dichotomy of k8s: it's a really scalable system that makes it easy to spin up a cluster locally on your laptop and interact with the full API locally just like in prod. so it's a super scalable system with a dense array of features. but paradoxically most shops won't need the vast majority of k8s features ever and by the time they scale to where they do need a ton of distributed init features they're extremely close to the point where they'd be better served by a bespoke system conceived from scratch in house that solves problems very specific to the business in question. if you have many thousands of k8s nodes, you're probably in the gray area of if using k8s is worth it because the loop of k8s will never be as fast as a centralized push control plane vs the k8s pull/watch control plane. and naturally at scale that problem will only compound

pas

4 months ago

but it's also standard, you can hire for it, outsource it, etc.

and it's pretty modular too, so it can even serve as the host for the bespoke whatever that's needed

though I remember reading the fly.io blog post about their custom scheduler/allocator which illustrates nicely how much of a difference a custom in-house solution makes if works well

trenchpilgrim

4 months ago

The other draw: Because k8s is open, you can easily hire employees, contractors, consultants and vendors and have them immediately solve problems within the k8s ecosystem. If you run a bespoke system, you have to train engineers on the system before they can make large contributions.

varispeed

4 months ago

> Why not replace etcd altogether with something like Postgres + Redis/NATS?

Holy Raft protocol is the blockchain of cloud.

trenchpilgrim

4 months ago

You can do leader election without etcd. The thing etcd buys you is you can have clusters of 3, 5, 7 or 9 DB nodes and lose up to 1, 2, 3, or 4 nodes respectively. But honestly, the vast majority of k8s users would be fine with a single SQL instance backing each k8s cluster and just running two or more k8s clusters for HA.

k3s doesn't require etcd, I'm pretty sure GKE uses Spanner and Azure uses Cosmos under the hood.

cyberax

4 months ago

> etcd is also the entire point of k8s. that it's a single self-contained framework and doesn't require an external backer service. there is no kubernetes without etcd.

Sorry, this is just BS. etcd is a fifth wheel in most k8s installations. Even the largest clusters are better off with something like a large-ish instance running a regular DB for the control plane state storage.

Yes, etcd theoretically protects against any kind of node failures and network partitions. But in practice, well, nobody really cares about the control plane being resilient against meteorite strikes and Cthulhu rising from the deeps.

itsnowandnever

4 months ago

that's not my point - my point is it would not have gotten the adoption it has without etcd and the fact that it was resilient and scalable out of the box

cyberax

4 months ago

It probably would have gotten even faster adoption if it used a saner embedded DB.

kevin_nisbet

4 months ago

I'm with you, I think most people might think they don't need this reliability, until they do. I'm sure there is some subset of clusters where the claim is correct.

But from the article, turning off fsync and expecting to only lose a few ms of updates. I've tried to recover etcd on volumes that lied about fsync and experienced a power outage, and I don't think we managed to recover it. There might be more options now to recover and ignore corrupted WAL entries, but at that time it was very difficult and I think we ended up just reinstalling from scratch. For clusters where this doesn't matter or the SLOs for recovery account for this, I'm totally onboard, but only if you know what you're doing.

And similar the point from the article that "full control plane data loss isn’t catastrophic in some environments" is correct, in the sense of what the author means by some environments. Because I don't think it's limited to those that are management by gitops as suggested, but where there is enough resiliency and time to redeploy and do all the cleanup.

Anyways, like much advice on the internet, it's not good or bad, just highly situational, and some of the suggestions should only be applied if the implications are fully understood.

wb14123

4 months ago

Instead of giving up the good guarantee of etcd, a better approach maybe grouping some nodes together to create a tree like structure with sub clusters.

AlphaSite

4 months ago

That was the whole concept behind KCP iirc. It was designed to provide tenancy atop 1 or more clusters.

rixed

4 months ago

I don't get the point of benchmarking k8s without the guarantees of etcd. At some point, you are just competing with clusterssh.

cyberax

4 months ago

How often do you have sudden host failures? Especially if you use a half-decent server with redundant components for the DB node?

Once in maybe 10 years?

dmlittle

4 months ago

The node failure rate is much higher than that. On a 1M node cluster of cloud-managed instances (AWS, GCP, Azure, etc.) you'd likely see failures a few times a month, if not more.

cyberax

4 months ago

Yep. And the chances that the DB node with the control plane fails are therefore less than one in ten thousand.

rixed

4 months ago

It's not about failure, it's about servicing the host running etcd or the network around it.

rememberlenny

4 months ago

People don’t realize how crucial Ben was in the forming of OpenAI as it is known today. This is an extremely underrated post.

skeptrune

4 months ago

I read this as napkin math[1] for Kube and thoroughly enjoyed. You can only find the important numbers relative to performance and scaling by trying to accomplish some kind of goal. Benchmarks are mostly bikeshedding.

[1]: https://sirupsen.com/napkin

throwdbaaway

4 months ago

This looks impressive. As someone who is not familiar with ML, I do have a question -- surely in 2025 there must be a way to schedule a large pytorch job across multiple k8s clusters? EKS and GKE already provide VPC native flat network by default .

sailingparrot

4 months ago

The issue isn’t so much scheduling as it is stability.

More clusters means one more layer of things that can crash your (very expensive) training.

You also then still need to write tooling to manage cross cluster trainings correctly just as starting/stopping roughly at the same time, resuming from checkpoints, node health monitoring etc.

Nothing dealbreaking, but if it could just work in a single cluster that would be nicer.

brimstedt

4 months ago

It's an interesting and fun experiment, but what are real usecases for such a cluster?

dbattaglia

4 months ago

At my last employer Elastic we definitely ran into these limits on the cloud SaaS team moving Elastocsearch node containers from our proprietary orchestration to k8s. I’m not sure how they eventually solved it but I believe the plan was essentially sharding ES clusters to different regional k8s clusters.

adsharma

4 months ago

I saw one fleeting reference to a CNI in the article.

Anyone familiar with the space will tell you this is the biggest blocker in production.

You will have to pay for an "enterprise" CNI to make it work.

ai-christianson

4 months ago

Love this. There's no reason k8s shouldn't scale much further.

rvz

4 months ago

> Many limitations are imposed by software.

Or the amount of funding a startup has.

The bottom line is, you are not OpenAI or Google.

ktpsns

4 months ago

Typical large scale high performance computing clusters are at a size of 10k nodes (for instance Jupiter and SuperMUC in Germany) [1]. These centers are quite remarkably big buildings. I wonder how much 1M node single k8s clusters there are in the world right now. Most likely at the hyperscalers.

[1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing. Then we talk about order of 100k cores to be scheduled.

stackskipton

4 months ago

I doubt any Hyperscalers are running 1M Node clusters either. They probably just have groups of clusters at each datacenter and some overall scheduler that determines which cluster is best suited for workload during deployment then connects to that cluster and schedules the workload.

user

4 months ago

[deleted]

merb

4 months ago

Some hyperscalers even have services for that. Which even makes it possible to have cross cluster ingress. And other things. And it makes it possible to have multiple cluster ingress different regions that somewhat work together.

osigurdson

4 months ago

>> [1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing

I'm sure they mean actual servers / not just cores. Even in traditional HPC it isn't abstracted to the level of individual cores usually since most HPC jobs care about memory bandwidth - even with Infiniband or other techniques throughput / latency is much worse than on a single machine. Of course, multiple machines are connected (usually using MPI / Infiniband) but important to try to minimize communication between nodes where possible.

For AI workloads, they are running GPUs - so 10K+ cores on a single device so even less likely to be talking about cores here.

femiagbabiaka

4 months ago

This post is just about a reference architecture, but last I knew OpenAI ran VMs on Azure.

https://openai.com/index/scaling-kubernetes-to-7500-nodes/

https://github.com/bchess/k8s-1m/tree/main/mem_etcd

https://github.com/bchess/k8s-1m/blob/main/RUNNING.adoc#mem_...