Corrosion

155 pointsposted 4 days ago
by cgb_

76 Comments

kflansburg

8 hours ago

> an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.

I believe this behavior is changing in the 2024 edition: https://doc.rust-lang.org/edition-guide/rust-2024/temporary-...

kibwen

8 hours ago

> I believe this behavior is changing

Past tense, the 2024 edition stabilized in (and has been the default edition for `cargo new` since) Rust 1.85.

kflansburg

8 hours ago

Yes, I've already performed the upgrade for my projects, but since they hit this bug, I'm guessing they haven't.

kibwen

7 hours ago

They may have upgraded by now, their source links to a thread from a year ago, prior to the 2024 edition, which may be when they encountered that particular bug.

kflansburg

7 hours ago

I see now that this incident happened in September 2024 as well.

ricardobeat

8 hours ago

> Like an unattended turkey deep frying on the patio, truly global distributed consensus promises deliciousness while yielding only immolation

Their writing is so good, always a fun and enlightening read.

natebrennand

4 hours ago

> Finally, let’s revisit that global state problem. After the contagious deadlock bug, we concluded we need to evolve past a single cluster. So we took on a project we call “regionalization”, which creates a two-level database scheme. Each region we operate in runs a Corrosion cluster with fine-grained data about every Fly Machine in the region. The global cluster then maps applications to regions, which is sufficient to make forwarding decisions at our edge proxies.

This tier approach makes a lot of sense to mitigate the scaling limit per corrosion node. Can you share how much data you wind up tracking in each tier in practice?

How concise is the entry for each application -> [regions] table? Does the constraint of running this on every node mean that this creates a global limit for number of applications? It also seems like the region level database would have a regional limit for the number of Fly machines too?

anentropic

5 hours ago

blog posts should have a date at the top

chrisweekly

4 hours ago

YES. THIS. ALWAYS!

Huge pet peeve. At least this one has a date somewhere (at the bottom, "last updated Oct 22, 2025").

blinkingled

6 hours ago

> The bidding model is elegant, but it’s insufficient to route network requests. To allow an HTTP request in Tokyo to find the nearest instance in Sydney, we really do need some kind of global map of every app we host.

So is this a case of wanting to deliver a differentiating feature before the technical maturity is there and validated? It's an acceptable strategy if you are building a lesser product but if you are selling Public Cloud maybe having a better strategy than waiting for problems to crop up makes more sense? Consul, missing watchdogs, certificate expiry, CRDT back filling nullable columns - sure in a normal case these are not very unexpected or to-be-ashamed-of problems but for a product that claims to be Public Cloud you want to think of these things and address them before day 1. Cert expiry for example - you should be giving your users tools to never have a cert expire - not fixing it for your stuff after the fact! (Most CAs offer API to automate all this - no excuse for it.)

I don't mean to be dismissive or disrespectful, the problem is challenging and the work is great - merely thinking of loss of customer trust - people are never going to trust a new comer that has issues like this and for that reason move fast break things and fix when you find isn't a good fit for this kind of a product.

tptacek

6 hours ago

It's not a "differentiating feature"; it eliminated a scaling bottleneck. It's also a decision that long predates Corrosion.

blinkingled

6 hours ago

I was referring to the "HTTP request in Tokyo to find the nearest instance in Sydney" part which felt to me like a differentiating feature- no other cloud provider seems to have bidding or HTTP request level cross regional lookup or whatever.

The "decision that long predates Corrosion" is precisely the point I was trying to make - was it made too soon before understanding the ramifications and/or having a validated technical solution ready? IOW maybe the feature requiring the problem solution could have come later? (I don't know much about fly.io and its features, so apologies if some of this is unclear/wrongly assumes things.)

tptacek

6 hours ago

That's literally the premise of the service and always has been.

x0x0

5 hours ago

fwiw, I'm happily running a company and some contract work on fly literally as aws, but what if it weren't the most massively complex pile of shit you've ever seen.

I have a couple reasonably sized, understandable toml files and another 100 lines of ruby that runs long-running rake tasks as individual fly machines. The whole thing works really nicely.

soamv

10 hours ago

> New nullable columns are kryptonite to large Corrosion tables: cr-sqlite needs to backfill values for every row in the table

Is this a typo? Why does it backfill values for a nullable column?

andrewaylett

9 hours ago

I assume it would backfill values for any column, as a side-effect of propagating values for any column. But nullable columns are the only type you can add to a table that already contains rows, and mean that every row immediately has an update that needs to be sent.

jimmyl02

6 hours ago

always wondered at what scale gossip / SWIM breaks down and you need a hierarchy / partitioning. fly's use of corrosion seems to imply it's good enough for a single region which is pretty surprising because iirc Uber's ringpop was said to face problems at around 3K nodes.

it would be super cool to learn more about how the world's largest gossip systems work :)

tptacek

5 hours ago

SWIM is probably going to scale pretty much indefinitely. The issue we have with a single global SWIM broadcast domain isn't that the scale is breaking down; it's just that the blast radius for bugs (both in Corrosion itself, and in the services that depend on Corrosion) is too big.

We're actually keeping the global Corrosion cluster! We're just stripping most of the data out of it.

chucky_z

3 hours ago

Back of napkin math I’ve done previously, it breaks down around 2 million members with Hashicorps defaults. The defaults are quite aggressive though and if you can tolerate seconds of latency (called out in the article) you could reach billions without a lot of trouble.

tptacek

3 hours ago

It's also frequency of changes and granularity of state, when sizing workloads. My understanding is that most Hashi shops would federate workloads of our size/global distribution; it would be weird to try to run one big cluster to capture everything.

chucky_z

2 hours ago

From my literal conversation I'm having right now, 'try to run one big cluster to capture everything' is our active state. I've brought up federation a bunch of times and it's fallen on deaf ears. :)

We are probably past the size of the entirety of fly.io for reference, and maintenance is very painful. It works because we are doing really strange things with Consul (batch txn cross-cluster updates of static entries) on really, really big servers (4gbps+ filesystems, 1tb memory, 100s of big and fast cores, etc).

conradev

6 hours ago

  To ensure every instance arrives at the same “working set” picture, we use cr-sqlite, the CRDT SQLite extension.
Cool to see cr-sqlite used in production!

mosura

6 hours ago

Someone needs to read about ant colony optimization. https://en.wikipedia.org/wiki/Ant_colony_optimization_algori...

This blog is not impressive for an infra company.

tucnak

6 hours ago

I respect Fly, and it does sound like a nice place to work, but honestly, you're onto something. You would expect ostensibly Public Cloud provider to have a more solid grasp on networking. Instead, we're discovering how they're learning about things like OSPF!

Makes you think that's all.

tptacek

5 hours ago

What a weird thing to say. I wrote my first OSPF implementation in 1999. The point is that we noticed the solution we'd settled on owes more to protocols like OSPF than to distributed consensus databases, which are the mainstream solution to this problem. It's not "OMG we just discovered this neat protocol called OSPF". We don't actually run OSPF. We don't even do a graph->tree reduction. We're routing HTTP requests, not packets.

mosura

5 hours ago

Look at one of the other comments:

> in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution"

This is what people saw as the key takeaway. If that takeaway is news to you then I don’t know what you are doing writing distributed systems.

While this message may not be what was intended it was what was broadcast.

akerl_

5 hours ago

It seems weird to take an inaccurate paraphrase from a commenter and then use it to paint the authors with your desired brush.

mosura

4 hours ago

Not sure the replies to that comment help the cause at all.

bananapub

9 hours ago

in case people don't read all the way to the end, the important takeaway is "you simply can't afford to do instant global state distribution" - you can formal method and Rust and test and watchdog yourself as much as you want, but you simply have to stop doing that or the unknown unknowns will just keep taking you down.

tptacek

7 hours ago

I mean, the thing we're saying is that instant global state with database-style consensus is unworkable. Instant state distribution though is kind of just... necessary? for a platform like ours. You bring up an app in Europe, proxies in Asia need to know about it to route to it. So you say, "ok, well, they can wait a minute to learn about the app, not the end of the world". Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

vlovich123

5 hours ago

> Now: that same European instance goes down. Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

But they have to. Physically no solution will be instantaneous because that’s not how the speed of light nor relativity works - even two events next to each other cannot find out about each other instantaneously. So then the question is “how long can I wait for this information”. And that’s the part that I feel isn’t answered - eg if the app dies, the TCP connections die and in theory that information travels as quickly as anything else you send. It’s not reliably detectable but conceivably you could have an eBPF program monitoring death and notifying the proxies. Thats the part that’s really not explained in the article which is why you need to maintain an eventually consistent view of the connectivity. I get maybe why that could be useful but noticing app connectivity death seems wrong considering I believe you’re more tracking machine and cluster health right? Ie not noticing an app instance goes down but noticing all app instances on a given machine are gone and consensus deciding globally where the new app instance will be as quickly as possible?

tptacek

5 hours ago

A request routed to a dead instance doesn't fall into a black hole: our proxies reroute it. But that's very slow; to deliver acceptable service quality you need to minimize the number of times that happens. So you can't accept a solution that leaves large windows of time within which every instance that has gone down has a stale entry. Remember: instances coming up and down happens all the time on this platform! It's part of the point.

__turbobrew__

6 hours ago

> Proxies in Asia need to know about that, right away, and this time you can't afford to wait.

Did you ever consider envoy xDS?

There are a lot of really cool things in envoy like outlier detection, circuit breakers, load shedding, etc…

tptacek

6 hours ago

Nope. Talk a little about how how Envoy's service discovery would scale to millions of apps in a global network? There's no way we found the only possible point in the solution space. Do they do something clever here?

What we (think we) know won't work is a topologically centralized database that uses distributed consensus algorithms to synchronize. Running consensus transcontinentally is very painful, and keep the servers central, so that update proposals are local and the protocol can run quickly, subjects large portions of the network to partition risk. The natural response (what I think a lot of people do, in fact) is just to run multiple consensus clusters, but our UX includes a global namespace for customer workloads.

__turbobrew__

6 hours ago

I haven’t personally worked on envoy xds, but it is what I have seen several BigCo’s use for routing from the edge to internal applications.

> Running consensus transcontinentally is very painful

You don’t necessarily have to do that, you can keep your quorum nodes (lets assume we are talking about etcd) far enough apart to be in separate failure domains (fires, power loss, natural disasters) but close enough that network latency isn’t unbearably high between the replicas.

I have seen the following scheme work for millions of workloads:

1. Etcd quorum across 3 close, but independent regions

2. On startup, the app registers itself under a prefix that all other app replicas register

3. All clients to that app issue etcd watches for that prefix and almost instantly will be notified when there is a change. This is baked as a plugin within grpc clients.

4. A custom grpc resolver is used to do lookups by service name

tptacek

6 hours ago

I'm thrilled to have people digging into this, because I think it's a super interesting problem, but: no, keeping quorum nodes close-enough-but-not-too-close doesn't solve our problem, because we support a unified customer namespace that runs from Tokyo to Sydney to São Paulo to Northern Virginia to London to Frankfurt to Johannesburg.

Two other details that are super important here:

This is a public cloud. There is no real correlation between apps/regions and clients. Clients are public Internet users. When you bring an app up, it just needs to work, for completely random browsers on completely random continents. Users can and do move their instances (or, more likely, reallocate instances) between regions with no notice.

The second detail is that no matter what DX compromise you make to scale global consensus up, you still need reliable realtime update of instances going down. Not knowing about a new instance that just came up isn't that big a deal! You just get less optimal routing for the request. Not knowing that an instance went down is a very big deal: you end up routing requests to dead instances.

The deployment strategy you're describing is in fact what we used to do! We had a Consul cluster in North America and ran the global network off it.

__turbobrew__

5 hours ago

> I'm thrilled to have people digging into this, because I think it's a super interesting problem

Yes, somehow this is a problem all the big companies have, but it seems like there is no standard solution and nobody has open sourced their stuff (except you)!

Taking a step back, and thinking about the AWS outage last week which was caused by a buggy bespoke system built on top of DNS, it seems like we need an IETF standard for service discovery. DNS++ if you will. I have seen lots of (ab)use of DNS for dynamic service discovery and it seems like we need a better solution which is either push based or gossip based to more quickly disseminate service discovery updates.

otterley

an hour ago

I work for AWS; opinions are my own and I’m not affiliated with the service team in question.

That a DNS record was deleted is tangential to the proximate cause of the incident. It was a latent bug in the control plane that updated the records, not the data plane. If the discovery protocol were DNS++ or /etc/hosts files, the same problem could have happened.

DNS has a lot of advantages: it’s a dirt cheap protocol to serve (both in terms of bytes over the wire and CPU utilization), is reasonably flexible (new RR types are added as needs warrant), isn’t filtered by middleboxes, has separate positive and negative caching, and server implementations are very robust. If you’re doing to replace DNS, you’re going to have a steep hill to climb.

__turbobrew__

5 hours ago

> you still need reliable realtime update of instances going down

The way I have seen this implemented is through a cluster of service watcher that ping all services once every X seconds and deregister the service when the pings fail.

Additionally you can use grpc with keepalives which will detect on the client side when a service goes down and automatically remove it from the subset. Grpc also has client side outlier detection so the clients can also automatically remove slow servers from the subset as well. This only works for grpc though, so not generally useful if you are creating a cloud for HTTP servers…

tptacek

5 hours ago

Detecting that the service went down is easy. Notifying every proxy in the fleet that it's down is not. Every proxy in the fleet cannot directly probe every application on the platform.

__turbobrew__

2 hours ago

I believe it is possible within envoy to detect a bad backend and automatically remove it from the load balancing pool, so why can the proxy not determine that certain backend instances are unavailable and remove them from the pool? No coordination needed and it also handles other cases where the backend is bad such as overload or deadlock?

It also seems like part of your pain point is that there is an any-to-any relationship between proxy and backend, but that doesn’t need to be the case necessarily, cell based architecture with shuffle sharding of backends between cells can help alleviate that fundamental pain. Part of the advantage of this is that config and code changes can then be rolled out cell by cell which is much safer as if your code/configs cause a fault in a cell it will only affect a subset of infrastructure. And if you did shuffle sharding correctly, it should have a negligible affect when a single cell goes down.

tptacek

an hour ago

Ok, again: this isn't a cluster of load balancers in front of a discrete collection of app servers in a data center. It's thousands of load balancers handling millions of applications scattered all over the world, with instances going up and down constantly.

The interesting part of this problem isn't noticing that an instance is down. Any load balancer can do that. The interesting problem is noticing than and then informing every proxy in the world.

I feel like a lot of what's happening in these threads is people using a mental model that they'd use for hosting one application globally, or, if not one, then a collection of applications they manage. These are customer applications. We can't assume anything about their request semantics.

__turbobrew__

17 minutes ago

> The interesting problem is noticing than and then informing every proxy in the world.

Yes and that is why I suggested why your any-to-any relationship of proxy to application is a decision you have made which is part of the painpoint that caused you to come up with this solution. The fact that any proxy box can proxy to any backend is a choice which was made which created the structure and mental model you are working within. You could batch your proxies into say 1024 cells and then assign a customer app to say 4/1024 cells using shuffle sharding. Then that decomposes the problem into maintaining state within a cell instead of globally.

Im not saying what you did was wrong or dumb, I am saying you are working within a framework that maybe you are not even consciously aware of.

tptacek

14 minutes ago

Again: it's the premise of the platform. If you're saying "you picked a hard problem to work on", I guess I agree.

We cannot in fact assign our customers apps to 0.3% of our proxies! When you deploy an app in Chicago on Fly.io, it has to work from a Sydney edge. I mean, that's part of the DX; there are deeper reasons why it would have to work that way (due to BGP4), but we don't even get there before becoming a different platform.

otterley

an hour ago

Out of curiosity, what’s your upper bound latency SLO for propagating this state? (I assume this actually conforms to a percentile histogram and isn’t a single value.)

JoachimSchipper

2 hours ago

(Hopping in here because the discussion is interesting... feel very free to ignore.)

Thanks for writing this up! It was a very interesting read about a part of networking that I don't get to seriously touch.

That said: I'm sure you guys have thought about this a lot and that I'm just missing something, but "why can't every proxy probe every [worker, not application]?" was exactly one of the questions I had while reading.

Having the workers being the source-of-truth about applications is a nicely resilient design, and bruteforcing the problem by having, say 10k proxies each retrieve the state of 10k workers every second... may not be obviously impossible? Somewhat similar to sending/serving 10k DNS requests/s/worker? That's not trivial, but maybe not _that_ hard? (You've been working on modern Linux servers a lot more than I, but I'm thinking of e.g. https://blog.cloudflare.com/how-to-receive-a-million-packets...)

I did notice the sentence about "saturating our uplinks", but... assuming 1KB=8Kb of compressed critical state per worker, you'd end up with a peak bandwidth demand of about 80 Mbps of data per worker / per proxy; that may not be obviously impossible? (One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.)

(Obviously, bruteforcing the routing table does not get you out of doing _something_ more clever than that to tell the proxies about new workers joining/leaving the pool, and probably a hundred other tasks that I'm missing; but, as you imply, not all tasks are equally timing-critical.)

The other question I had while reading was why you need one failure/replication domain (originally, one global; soon, one per-region); if you shard worker state over 100 gossip (SWIM Corrosion) instances, obviously your proxies do need to join every sharded instance to build the global routing table - but bugs in replication per se should only take down 1/100th of your fleet, which would hit fewer customers (and, depending on the exact bug, may mean that customers with some redundancy and/or autoscaling stay up.) This wouldn't have helped in your exact case - perfectly replicating something that takes down your proxies - but might make a crash-stop of your consensus-ish protocol more tolerable?

Both of the questions above might lead to a less convenient programming model, which be enough reason on its own to scupper it; an article isn't necessarily improved by discussing every possible alternative; and again, I'm sure you guys have thought about this a lot more than I did (and/or that I got a couple of things embarassingly wrong). But, well, if you happen to be willing to entertain my questions I would appreciate it!

DAlperin

2 hours ago

(I used to work at Fly, specifically on the proxy so my info may be slightly out of date, but I've spent a lot of time thinking about this stuff.)

> why can't every proxy probe every [worker, not application]?

There are several divergent issues with this approach (though it can have it's place). First, you still need _some_ service discovery to tell you where the nodes are, though it's easy to assume this can be solved via some consul-esque system. Secondly, there is a lot more data than you might be thinking at play here. A single proxy/host might have many thousands of VMs under its purview. That works out to a lot of data. As you point out there are ways to solve this:

> One could reduce _average_ bandwidth a lot by having the proxies mostly send some kind of "send changes since <...>" or "send all data unless its hash is <...>" query.

This is definitely an improvement. But we have a new issue. Lets say I have proxies A, B, and C. A and C lose connectivity. Optimally (and in fact fly has several mechanisms for this) A could send it's traffic to C via B. But in this case it might not even know that there is a VM candidate on C at all! It wasn't able to sync data for a while.

There are ways to solve this! We could make it possible for proxies to relay each others state. To recap: - We have workers that poll each other - They exchange diffs rather than the full state - The state diffs can be relayed by other proxies

We have in practice invented something quite close to a gossip protocol! If we continued drawing the rest of the owl you might end up with something like SWIM.

As far as your second question I think you kinda got it exactly. A crash of a single corrosion does not generally affect anything else. But if something bad is replicated, or there is a gossip storm, isolating that failure is important.

tptacek

2 hours ago

Hold up, I sniped Dov into answering this instead of me. :)

justinparus

4 hours ago

The solutions across different BigCorp Clouds varies depending on the SLA from their underlying network. Doing this on top the public internet is very different than on redundant subsea fiber with dedicated BigCorp bandwidth!

otterley

an hour ago

Lots of solutions appear to work in a steady-state scenario—which, admittedly, is most of the time. The key question is how resilient to failure they are, not just under blackout conditions but brownouts as well.

Many people will read a comment like this and cargo-cult an implementation (“millions of workloads”, you say?!) without knowing how they are going to handle the many different failure modes that can result, or even at what scale the solution will break down. Then, when the inevitable happens, panic and potentially data loss will ensue. Or, the system will eventually reach scaling limits that will require a significant architectural overhaul to solve.

TL;DR: There isn’t a one-size-fits-all solution for most distributed consensus problems, especially ones that require global consistency and fault tolerance, and on top of that have established upper bounds on information propagation latency.

hedgehog

6 hours ago

Is it actually necessary to run transcontinental consensus? Apps in a given location are not movable so it would seem for a given app it's known which part of the network writes can come from. That would require partitioning the namespace but, given that apps are not movable, does that matter? It feel like there are other areas like docs and tooling that would benefit from relatively higher prioritization.

tptacek

6 hours ago

Apps in a given location are extremely movable! That's the point of the service!

hedgehog

5 hours ago

We unfortunately lost our location with not a whole lot of notice and the migration to a new one was not seamless, on top of things like the GitHub actions being out of date (only supporting the deprecated Postgres service, not the new one).

nodesocket

6 hours ago

Anybody used rqlite[1] in production? I'm exploring how to make my application fault-tolerant using multiple app vm instances. The problem of course is the SQLite database on disk. Using a network file system like NFS is a no-go with SQLite (this includes Amazon Elastic File System (EFS)).

I was thinking I'll just have to bite the bullet and migrate to PostgreSQL, but perhaps rqlite can work.

[1] https://rqlite.io

throwaway290

10 hours ago

I guess all designers at fly were replaced by ai because this article is using gray bold font for the whole text. I remember these guys had good blog some time ago

tptacek

7 hours ago

The design hasn't changed in years. If someone has a screenshot and a browser version we can try to figure out why it's coming out fucky for you.

kg

7 hours ago

Looking at the css, there's a .text-gray-600 CSS style that would cause this, and it's overridden by some other style in order to achieve the actual desired appearance. Maybe the override style isn't loading - perhaps the GP has javascript disabled?

throwaway290

3 hours ago

javascript is enabled but I don't see the problem on another phone, so yeah seems related

dewey

10 hours ago

Not sure if that was changed since then, but it's not bold for me and also readable. Maybe browser rendering?

ceigey

9 hours ago

Also not bold for me (Safari). Variable font rendering issue?

throwaway290

8 hours ago

stock safari on ios 26 for me. is it another of 37366153 regressions of ios 26?

iviv

7 hours ago

Looks normal to me on iOS 26.0.1

throwaway290

8 hours ago

stock safari on ios

and I think the intended webfont is loaded because the font is clearly weird ish and non-standard and the text is invisible for good 2 seconds at first while it loads:)

mcny

8 hours ago

Please try the article mode in your web browser. Firefox has a pretty good one but I understand all major browsers have this now.

throwaway290

8 hours ago

I only use article mode in exceptional cases. I hold fly to higher standard than that.

jjtheblunt

7 hours ago

latest macos firefox and safari both show grey on white, legible but contrast somewhat lacking, but rendered properly for grey on white.

foofoo12

10 hours ago

It's totally unreadable.

davidham

8 hours ago

Looks like it always has, to me.

tucnak

6 hours ago

What's this obsession with SQLite? For all intents and purposes, what they'd accomplished is effectively a Type 2 table with extra steps. CRDT is totally overkill in this situation. You can implement this in Postgres easily with very little changes to your access patterns... DISTINCT ON. Maybe this kind of "solution" is impressive for Rust programmers, I'm not sure what's the deal exactly, but all it tells me is Fly ought to hire actual networking professionals, maybe even compute-in-network guys with FPGA experience like everyone else, and develop their own routers that way—if only to learn more about networking.

tptacek

6 hours ago

What part of this problem do you think FPGAs would help with?

In what sense do you think we need specialty routers?

How would you deploy Postgres to address these problems?