AWS Multiple Services Down in us-east-1

935 pointsposted 10 hours ago
by kondro

408 Comments

stego-tech

8 minutes ago

Not remotely surprised. Any competent engineer knows full well the risk of deploying into us-east-1 (or any “default” region for that matter), as well as the risks of relying on global services whose management or interaction layer only exists in said zone. Unfortunately, us-east-1 is the location most outsourcing firms throw stuff, because they don’t have to support it when it goes pear-shaped (that’s the client’s problem, not theirs).

My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.

indoordin0saur

an hour ago

Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.

markus_zhang

17 minutes ago

It has been quite a while, wondering how many 9s are dropped.

365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.

codeduck

12 minutes ago

I'm sure they'll find some way to weasel out of this.

PeterCorless

an hour ago

Downdetector had 5,755 reports of AWS problems at 12:52 AM Pacific (3:53 AM Eastern).

That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).

However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).

Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.

rogerrogerr

24 minutes ago

Where do they source those reports from? Always wondered if it was just analysis of how many people are looking at the page, or if humans somewhere are actually submitting reports.

jedberg

23 minutes ago

It's both. They count a hit from google as a report of that site being down. They also count that actual reports people make.

outworlder

an hour ago

I'm wondering why your and other companies haven't just evicted themselves from us-east-1. It's the worst region for outages and it's not even close.

Our company decided years ago to use any region other than us-east-1.

Of course, that doesn't help with services that are 'global', which usually means us-east-1.

jedberg

22 minutes ago

Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.

twistedpair

19 minutes ago

Services like SES Inbound are only available in 2x US regions. AWS isn't great about making all services available in all regions :/

kondro

15 minutes ago

One of those still isn’t us-east-1 though and email isn’t latency-bound.

sleepybrett

17 minutes ago

So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.

I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.

DrBenCarson

6 minutes ago

Yep. Many, many companies are fine saying “we’re going to be no more available than AWS is.”

lordnacho

35 minutes ago

Is there some reason why "global" services aren't replicated across regions?

I would think a lot of clients would want that.

rhplus

13 minutes ago

Data residency laws may be a factor in some global/regional architectures.

bcrosby95

33 minutes ago

Global replication is hard and if they weren't designed with that in mind its probably a whole lot of work.

ineedasername

23 minutes ago

I thought part of the point of using AWS was that such things were pretty much turnkey?\

Forricide

an hour ago

Definitely seems to be getting worse, outside of AWS itself, more websites seem to be having sporadic or serious issues. Concerning considering how long the outage has been going.

busymom0

an hour ago

That's probably why Reddit has been down too

autophagian

an hour ago

Yeah. We had a brief window where everything resolved and worked and now we're running into really mysterious flakey networking issues where pods in our EKS clusters timeout talking to the k8s API.

cj00

14 minutes ago

Yeah, networking issues cleared up for a few hours but now seem to be as bad as before.

JCM9

an hour ago

Agree… still seeing major issues. Briefly looked like it was getting better but things falling apart again.

assholesRppl2

an hour ago

Yep, confirmed worse - DynamoDB now returning "ServiceUnavailableException"

dutzi

an hour ago

Here as well…

claudiug

an hour ago

ServiceUnavailableException hello java :)

wavemode

an hour ago

SEV-0 for my company this morning. We can't connect to RDS anymore.

jmuguy

35 minutes ago

Yeah we were fine until about 1030 eastern and have been completely down since then, Heroku customer.

perching_aix

an hour ago

In addition to those, Sagemaker also fails for me with an internal auth error specifically in Virginia. Fun times. Hope they recover by tomorrow.

whaleofatw2022

an hour ago

Dangerous curiosity ask, is whether the number of folks off for Diwali is a factor or not?

I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.

junon

an hour ago

Seeing as how this is us-east-1, probably not a lot.

redeux

an hour ago

I believe the implication is that a lot of critical AWS engineers are of Indian descent and are off celebrating today.

herewulf

11 minutes ago

junon's implication may be that AWS engineers of Indian descent would tend to be located on the West Coast.

ljdtt

an hour ago

first time i see "fubar", is that a common expression on the industry? jsut curious (english is not my native language)

sorentwo

an hour ago

It is an old US military term that means “F*ked Up Beyond All Recognition”

dingnuts

an hour ago

FUBAR being a bit worse than SNAFU: "situation normal: all fucked up" which is the usual state of us-east-1

vishnugupta

an hour ago

It used to be quite common but has fallen out of usage.

strictnein

an hour ago

FUBAR: Fucked Up Beyond All Recognition

Somewhat common. Comes from the US military in WW2.

parliament32

an hour ago

Yes, although it's military in origin.

wcchandler

18 minutes ago

This is usually something I see on Reddit first, within minutes. I’ve barely seen anything on my front page. While I understand it’s likely the subs I’m subscribed to, that was my only reason for using Reddit. I’ve noticed that for the past year - more and more tech heavy news events don’t bubble up as quickly anymore. I also didn’t see this post for a while for whatever reason. And Digg was hit and miss on availability for me, and I’m just now seeing it load with an item around this.

I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.

kccqzy

15 minutes ago

Reddit itself is having issues. I have multiple comments fail to post. And the Reddit user page leads me to a 404.

ryanisnan

10 minutes ago

Anecdotally, I think you should disregard this. I found out about this issue first via Reddit, roughly 30 minutes after the onset (we had an alarm about control plane connectivity).

midtake

10 minutes ago

Reddit is worthless now, and posting about your tech infrastructure on reddit is a security and opsec lapse. My workplace has reddit blocked at the edge. I would trust X more than reddit, and that is with X having active honeypot accounts (it is even a meme about Asian girls). In fact, heard about this outage on X before anywhere else.

postexitus

17 minutes ago

Most Reddit API is down as well.

Krasnol

5 minutes ago

I found it pretty fast on the /r/signal sub and went from there.

altbdoor

an hour ago

Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.

aeve890

an hour ago

Similar experience here. People laughed and some said something like "well, if something like AWS falls then we have bigger problems". They laugh because honestly is too far-fetched to think the whole AWS infra going down. Too big to fail as they say in the US. Nothing short of a nuclear war would fuck up the entire AWS network so they're kinda right.

Until this happen. A single region in a cascade failure and your saas is single region.

stephenlf

41 minutes ago

They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.

oceanplexian

22 minutes ago

Why would your competitors go down? AWS has at best 30-35% market share. And that's ignoring the huge mass of companies who still run their infrastructure on bare metal.

codeduck

6 minutes ago

because your competitors are probably using services that depend on AWS.

palmotea

36 minutes ago

>> People laughed and some said something like "well, if something like AWS falls then we have bigger problems".

> They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.

They made their own bigger problems by all crowding into the same single region.

pluto_modadic

2 minutes ago

it's a weird effect:

imagine a beach, with icecream vendors. You'd think it would be optimal for two vendors to each split it half north, half south. However, in wanting to steal some of the other vendors' customers, you end up with two icecream stands in the center.

So too with outages. Safety / loss of blame in numbers.

Elidrake24

an hour ago

If you were dependent upon a single distribution (region) of that Service, yes it would be a massive single point of failure in this case. If you weren't dependent upon a particular region, you'd be fine.

zimbu668

an hour ago

Of course lots of AWS services have hidden dependencies on us-east-1. During a previous outage we needed to update a Route53(DNS) record in us-west-2, but couldn't because of the outage in us-east-1.

ineedasername

16 minutes ago

So, AWS's redundant availability goes something like "Don't worry, if nothing is working in us-east-1, it will trigger failover to another regions" ... "Okay, where's that trigger located?" ... "In the us-east-1 region also" ... "Doens't that seem a problem to you?" ... "You'd think it might be! But our logs say it's never been used."

wubrr

an hour ago

Some 'regional' AWS services still rely on other services (some internal) that are only in us-east-1.

antinomicus

17 minutes ago

Even Amazon’s own services (ie ring) were affected by this outage

mlhpdx

2 hours ago

Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).

My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.

The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).

Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.

Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.

SteveNuts

an hour ago

> Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted

This seems like such a low bar for 2025, but here we are.

AndrewKemendo

16 minutes ago

How did you do resilient auth for keys and certs?

chibea

6 hours ago

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?

wwdmaxwell

6 hours ago

I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.

Both of which seem to prop up in post mortems for these widespread outages.

oofbey

2 hours ago

They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.

cowsandmilk

6 hours ago

Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.

joncrane

4 hours ago

Which is interesting because per their health dashboard,

>We recommend customers continue to retry any failed requests.

otterley

2 hours ago

They should continue to retry but with exponential backoff and jitter. Not in a busy loop!

bcrosby95

an hour ago

If the reliability of your system depends upon the competence of your customers then it isn't very reliable.

otterley

an hour ago

Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?

There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.

See e.g., https://d1.awsstatic.com/builderslibrary/pdfs/Resilience-les...

ifwinterco

42 minutes ago

Probably stupid question (I am not a network/infra engineer) - can you not simply rate limit requests (by IP or some other method)?

Yes your customers may well implement stupidly aggressive retries, but that shouldn't break your stuff, they should just start getting 429s?

otterley

29 minutes ago

Load shedding effectively does that. 503 is the correct error code here to indicate temporary failure; 429 means you've exhausted a quota.

veltas

3 hours ago

Can't exactly change existing widespread practice so they're ready for that kind of handling.

cyberax

2 hours ago

When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?

> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.

IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).

JCM9

an hour ago

At 3:03 AM PT AWS posted that things are recovering and sounded like issue was resolved.

Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.

Honestly sounds like AWS doesn’t even really know what’s going on. Not good.

vishnugupta

an hour ago

This is exacerbated by the fact that this is Diwali week which means the most of Indian engineers will be out on leave. Tough luck.

dingnuts

22 minutes ago

atheists should get a week when they don't have to reply to pages, too, damnit

jedimastert

3 minutes ago

I'm unaware of any organization that doesn't give the same number of vacation days based on religion and region- or company-wide holidays...

bigbuppo

an hour ago

Whose idea was it to make the whole world dependent on us-east-1?

alchemism

44 minutes ago

AWS often deploys its new platform products and latest hardware (think GPUs) into us-east-1, so everyone has to maintain a footprint in us-east-1 to use any of it.

So as a result, everyone keeps production in us-east-1. =)

nemomarx

an hour ago

The NSA might be happy everything runs through a local data center to their Virginia offices

kondro

a minute ago

The most recent public count of datacenters for AWS in us-east-1 is 159. I suspect that’s even an unwieldy number for NSA to spy on.

rickmode

an hour ago

Isn’t it the cheapest AWS region? Or at least among the cheapest. If I’m correct, this incentivizes users to start there.

everfrustrated

15 minutes ago

All countries are more expensive than the US, but many US regions are the same (cheapest) price. (eg Ohio, Oregon).

us-east-1 is more a legacy of it being the first region so by virtue of being the default region for the longest time most customers built on it.

wting

an hour ago

Eh, us-east-1 is the oldest AWS region and if you get some AWS old timers talking over some beers they'll point out the legacy SPOFs that still exist in us-east-1.

lenerdenator

an hour ago

1) People in us-east-1.

2) People who thought that just having stuff "in the cloud" meant that it was automatically spread across regions. Hint, it's not; you have to deploy it in different regions and architect/maintain around that.

3) Accounting.

ibejoeb

3 hours ago

This is just a silly anecdote, but every time a cloud provider blips, I'm reminded. The worst architecture I've ever encountered was a system that was distributed across AWS, Azure, and GCP. Whenever any one of them had a problem, the system went down. It also cost 3x more than it should.

spyspy

2 hours ago

I've seen the exact same thing at multiple companies. The teams were always so proud of themselves for being "multi-cloud" and managers rewarded them for their nonsense. They also got constant kudos for their heroic firefighting whenever the system went down, which it did constantly. Watching actually good engineers get overlooked because their systems were rock-solid while those characters got all the praise for designing an unadulterated piece of shit was one of the main reasons I left those companies.

koakuma-chan

2 hours ago

Were you able to find a better company? If yes, what kind of company?

spyspy

an hour ago

I became one of the founding engineers at a startup, which worked for a little while until the team grew beyond my purview, and no good engineering plan survives contact with sales directors who lie to customers about capabilities our platform has.

waynesonfire

24 minutes ago

multi-cloud... any leader that approves such a boondoggle should be labelled incompetent. These morons sell it as a cost-cutting "migration". Never once have I seen such a project complete and it more than doubles complexity and costs.

AnimeLife

2 hours ago

looks like very few get it right. A good system would have few minutes of blip when one cloud provider went down, which is a massive win compared to outages like this.

steveBK123

2 hours ago

They all make it pretty hard, and a lot of resume-driven-devs have a hard time resisting the temptation of all the AWS alphabet soup of services.

Sure you can abstract everything away, but you can also just not use vendor-flavored services. The more bespoke stuff you use the more lock in risk.

But if you are in a "cloud forward" AWS mandated org, a holder of AWS certifications, alphabet soup expert... thats not a problem you are trying to solve. Arguably the lock in becomes a feature.

bcrosby95

an hour ago

In general, the problem with abstracting infrastructure means you have to code to the lowest common denominator. Sometimes its worth it. For companies I work for it really isn't.

notyourwork

an hour ago

Lock-in is another way to say "bespoke product offering". Sometimes solving the problem a cloud provider service exposes is not worth it. This locks you in for the same reasons that a specific restaurant locks you in because its their recipe.

steveBK123

an hour ago

Putting aside outages..

I'd counter that past a certain scale, certainly the scale of a firm that used to & could run its own datacenter.. it's probably your responsibility to not use those services.

Sure it's easier, but if you decide feature X requires AWS service Y that has no GCP/Azure/ORCL equivalent.. it seems unwise.

Just from a business perspective, you are making yourself hostage to a vendor on pricing.

If you're some startup trying to find traction, or a small shop with an IT department of 5.. then by all means, use whatever cloud and get locked in for now.

But if you are a big bank, car maker, whatever.. it seems grossly irresponsible.

On the east coast we are already approaching an entire business day being down today. Gonna need a decade without an outage to get all those 9s back. And not to be catastrophic but.. what if AWS had an outage like this that lasted.. 3 days? A week?

The fact that the industry collectively shrugs our shoulders and allows increasing amounts of our tech stacks to be single-vendor hostage is crazy.

palmotea

19 minutes ago

> I'd counter that past a certain scale, certainly the scale of a firm that used to & could run its own datacenter.. it's probably your responsibility to not use those services.

It's actually probably not your responsibility, it's the responsibility of some leader 5 levels up who has his head in the clouds (literally).

It's a hard problem to connect practical experience and perspectives with high-level decision-making past a certain scale.

throwawaysleep

26 minutes ago

> The fact that the industry collectively shrugs our shoulders and allows increasing amounts of our tech stacks to be single-vendor hostage is crazy.

Well, nobody is going to get blamed for this one except people at Amazon. Socially, this is treated as as a tornado. You have to be certain that you can beat AWS in terms of reliability for doing anything about this to be good for your career.

steveBK123

13 minutes ago

In 20+ years in the industry, all my biggest outages have been AWS... and they seem to be happening annually.

Most of my on-prem days, you had more frequent but smaller failures of a database, caching service, task runner, storage, message bus, DNS, whatever.. but not all at once. Depending on how entrenched your organization is, some of these AWS outages are like having a full datacenter power down.

Might as well just log off for the day and hope for better in the morning. That assumes you could login, which some of my ex-US colleagues could not for half the day, despite our desktops being on-prem. Someone forgot about the AWS 2FA dependency..

ferguess_k

an hour ago

I think the problems are:

1) If you try to optimize in the beginning, you tend to fall into the over-optimization/engineering camp;

2) If you just let things go organically, you tend to fall into the big messy camp;

So the ideal way is to examine from time and time and re-architecture once the need arises. But few companies can afford that, unfortunately.

manishsharan

2 hours ago

You mean multi-cloud strategy ! You wanna know how you got here ?

See the sales team from Google flew out an executive to NBA Finals, Azure Sales team flew out another executive to NFL superBowl and the AWS team flew out yet another executive to Wimbledon finals. And thats how you end up with multi-cloud strategy.

kevstev

2 hours ago

Eh, businesses want to stay resilient to a single vendor going down. My least favorite question in interviews this past year was around multi-cloud. Because imho it just isn't worth it- the increased complexity, the trying to like-like services across different clouds that aren't always really the same, and then just the ongoing costs of chaos monkeying and testing that this all actually works, especially in the face of a partial outage like this vs something "easy" like a complete loss of network connectivity... but that is almost certainly not what CEOs want to hear (mostly who I am dealing with here going for VPE or CTO level jobs).

I could care less about having more vendor dinners when I know I am promising a falsehood that is extremely expensive and likely going to cost me my job or my credibility at some point.

ibejoeb

an hour ago

In this particular case, it was resume-oriented architecture (ROAr!) The original team really wanted to use all the hottest new tech. The management was actually rather unhappy, so the job was to pare that down to something more reliable.

avs733

2 hours ago

i'll bet there are a large number of systems that are dependent on multiple cloud platforms being up without even knowing it. They run on AWS, but rely on a tool from someone else that runs on GCP or on Azure and they haven't tested what happens if that tools goes down...

Common Cause Failures and false redundancy are just all over the place.

toephu2

7 minutes ago

Half the internet goes down because part of AWS goes down... what happened to companies having redundant systems and not having a single point of failure?

jppope

4 minutes ago

Ironically for most companies its cheaper to just say if AWS goes down half of the internet goes down so people will understand

Waterluvian

an hour ago

I know there's a lot of anecdotal evidence and some fairly clear explanations for why `us-east-1` can be less reliable. But are there any empirical studies that demonstrate this? Like if I wanted to back up this assumption/claim with data, is there a good link for that, showing that us-east-1 is down a lot more often?

everfrustrated

12 minutes ago

The unreliability claim is driven by two factors.

1. When aws deploys changes they run through a pipeline which pushes change to regions one at a time. Most services start with us-east-1 first.

2. us-east-1 is MASSIVE and considerably larger than the next largest region. There's no public numbers but I wouldn't be surprised if it was 50% of their global capacity. An outage in any other region never hits the news.

glemmaPaul

9 hours ago

LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?

DrScientist

8 hours ago

According to their status page the fault was in DNS lookup of the Dynamo services.

Everything depends on DNS....

mlrtime

7 hours ago

Dynamo had a outage last year if I recall correctly.

xtracto

4 hours ago

Lol ... of course it's DNS fault again.

KettleLaugh

8 hours ago

We maybe distributed, but we die united...

hangsi

7 hours ago

Divided we stand,

United we fall.

glemmaPaul

7 hours ago

AWS Communist Cloud

nyrp

6 hours ago

>circa 2005: Score:5, Funny on Slashdot

>circa 2025: grayed out on Hacker News

Hamuko

8 hours ago

I thought it was a pretty well-known issue that the rest of AWS depends on us-east-1 working. Basically any other AWS region can get hit by a meteor without bringing down everything else – except us-east-1.

philipallstar

8 hours ago

Just don't buy it if you don't want it. No one is forced to buy this stuff.

benterix

8 hours ago

> No one is forced to buy this stuff.

Actually, many companies are de facto forced to do that, for various reasons.

philipallstar

8 hours ago

How so?

jacquesm

8 hours ago

Certification, for one. Governments will mandate 'x, y and/or z' and only the big providers are able to deliver.

mlrtime

7 hours ago

That is not the same as mandating AWS, it just means certain levels of redundancy. There are no requirements to be in the cloud.

jacquesm

6 hours ago

No, that's not what it means.

It means that in order to be certified you have to use providers that in turn are certified or you will have to prove that you have all of your ducks in a row and that goes way beyond certain levels of redundancy, to the point that most companies just give up and use a cloud solution because they have enough headaches just getting their internal processes aligned with various certification requirements.

Medical, banking, insurance to name just a couple are heavily regulated and to suggest that it 'just means certain levels of redundancy' is a very uninformed take.

philipallstar

4 hours ago

It is definitely not true that only big companies can do this. It is true that every regulation added adds to the power of big companies, which explains some regulation, but it is definitely possible to do a lot of things yourself and evidence that you've done it.

What's more likely for medical at least is that if you make your own app, that your customers will want to install it into their AWS/Azure instance, and so you have to support them.

63stack

7 hours ago

Security/compliance theater for one

philipallstar

7 hours ago

That's not a company being forced to, though?

aembleton

7 hours ago

It is if they want to win contracts

philipallstar

6 hours ago

I don't think that's true. I think a company can choose to outsource that stuff to a cloud provider or not, but they can still choose.

esskay

2 hours ago

Er...They appear to have just gone down again.

jrochkind1

an hour ago

My systems didn't actually seem to be affected until what I think was probably a SECOND spike of outages at about the time you posted.

The retrospective will be very interesting reading!

(Obviously the category of outages caused by many restored systems "thundering" at once to get back up is known, so that'd be my guess, but the details are always good reading either way).

indoordin0saur

an hour ago

Mine are more messed up now (12:30 ET) than they were this morning. AWS is lying that they've fixed the issue.

lexandstuff

2 hours ago

Yep. I don't think they ever fully recovered, but status page is still reporting a lot of issues.

kalleboo

9 hours ago

It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.

It's still missing the one that earned me a phone call from a client.

zenexer

9 hours ago

It's seemingly everything. SES was the first one that I noticed, but from what I can tell, all services are impacted.

hvb2

7 hours ago

In AWS, if you take out one of dynamo db, S3 or lambda you're going to be in a world of pain. Any architecture will likely use those somewhere including all the other services on top.

If in your own datacenter your storage service goes down, how much remains running

goatking

2 hours ago

Agreed, but you can put EC2 on that list as well

mlrtime

7 hours ago

When these major issues come up, all they have is symptoms and not causes. Maybe not until the dynamo oncall comes on and says its down, then everyone knows at least the reason for their teams outage.

The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.

tonypapousek

7 hours ago

Looks like they’re nearly done fixing it.

> Oct 20 3:35 AM PDT

> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.

ceroxylon

20 minutes ago

In the last hour, I have seen the number of impacted services go from 90 to 92, currently sitting at 97.

shkkmo

2 hours ago

Still not fixed and may have gotten worse...

chibea

6 hours ago

It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...

rswail

6 hours ago

In that region, other regions are able to launch EC2s and ECS/EKS without a problem.

jamwil

4 hours ago

Is that material to a conversation about service uptime of existing resources, though? Are there customers out there that are churning through the full lifecycle of ephemeral EC2 instances as part of their day-to-day?

shawabawa3

3 hours ago

any company of non trivial scale will surely launch ec2 nodes during the day

one of the main points of cloud computing is scaling up and down frequently

newtwilly

2 hours ago

We spend ~$20,000 per month in AWS for the product I work on. In the average day we do not launch an EC2 instance. We do not do any dynamic scaling. However, there are many scenarios (especially during outages and such) that it would be critical for us to be able to launch a new instance (and or stop/start an existing instance.)

jamwil

3 hours ago

I understand scaling. I’m saying there is a difference in severity of several orders of magnitude between “the computers are down” and “we can’t add additional computers”.

archon810

2 hours ago

Except it just broke again.

czhu12

44 minutes ago

Our entire data stack (Databricks and Omni) are all down for us also. The nice thing is that AWS is so big and widespread that our customers are much more understanding about outages, given that its showing up on the news.

kuon

5 hours ago

I realize that my basement servers have better uptime than AWS this year!

I think most sysadmin don't plan for AWS outage. And economically it makes sense.

But it makes me wonder, is sysadmin a lost art?

tredre3

2 hours ago

> But it makes me wonder, is sysadmin a lost art?

Yes. 15-20 years ago when I was still working on network-adjacent stuff I witnessed the shift to the devops movement.

To be clear, the fact that devops don't plan for AWS failures isn't an indication that they lack the sysadmin gene. Sysadmins will tell you very similar "X can never go down" or "not worth having a backup for service Y".

But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.

"Thankfully" nowadays containers and application hosting abstracts a lot of it back away. So today I'd be willing to say that devops are sufficient for small to medium companies (and dare I say more efficient?).

archon810

2 hours ago

That's not very surprising. At this point you could say that your microwave has a better uptime. The complexity comparison to all the Amazon cloud services and infrastructure would be roughly the same.

TheCraiggers

5 hours ago

> But it makes me wonder, is sysadmin a lost art?

I dunno, let me ask chatgpt. Hmmm, it said yes.

tripplyons

3 hours ago

ChatGPT often says yes to both a question and its inverse. People like to hear yes more than no.

ninininino

2 hours ago

You missed their point. They were making a joke about over-reliance on AI.

starman55

17 minutes ago

Looks like they have a lot of circular dependencies, like ec2 depends upon DynamoDB and DynamoDB depends on ec2.

twistedpair

20 minutes ago

I just saw services that were up since 545AM ET go down around 12:30PM ET. Seems AWS has broken Lambda again in their efforts to fix things.

runako

5 hours ago

Even though us-east-1 is the region geographically closest to me, I always choose another region as default due to us-east-1 (seemingly) being more prone to these outages.

Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.

cyberax

an hour ago

This is the right move. 10 years ago, us-east-1 was on the order of 10x bigger than the next largest region. This got a little better now, but any scaling issues still tend to happen in us-east-1.

The AWS has been steering people to us-east-2 for a while. For example, traffic between us-east-1 and us-east-2 has the same cost as inter-AZ traffic within the us-east-1.

joncrane

4 hours ago

What services are only available in us-east-1?

tom1337

4 hours ago

IAM control plane for example:

> There is one IAM control plane for all commercial AWS Regions, which is located in the US East (N. Virginia) Region. The IAM system then propagates configuration changes to the IAM data planes in every enabled AWS Region. The IAM data plane is essentially a read-only replica of the IAM control plane configuration data.

and I believe some global services (like certificate manager, etc.) also depend on the us-east-1 region

https://docs.aws.amazon.com/IAM/latest/UserGuide/disaster-re...

runako

3 hours ago

In addition to those listed in sibling comments, new services often roll out in us-east-1 before being made available in other regions.

I recently ran into an issue where some Bedrock functionality was available in us-east-1 but not one of the other US regions.

tomchuk

4 hours ago

IAM, Cloudfront, Route53, ACM, Billing...

nijave

2 hours ago

parts of S3 (although maybe that's better after that major outage years ago)

itqwertz

22 minutes ago

Did they try asking Claude to fix these issues? If it turns out this problem is AI-related, I'd love to see the AAR.

Aldipower

8 hours ago

My minor 2000 users web app hosted on Hetzner works fyi. :-P

aembleton

7 hours ago

Right up until the DNS fails

Aldipower

7 hours ago

I am using ClouDNS. That is an AnycastDNS provider. My hopes are that they are more reliable. But yeah, it is still DNS and it will fail. ;-)

mlrtime

7 hours ago

But how are you going to web scale it!? /s

Aldipower

6 hours ago

Web scale? It is an _web_ app, so it is already web scaled, hehe.

Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.

There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.

jacquesm

8 hours ago

Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'

Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

padjo

7 hours ago

Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

mlrtime

7 hours ago

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.

joelthelion

6 hours ago

Call it the aws holiday. Most other companies will be down anyway. It's very likely that your company can afford to be down for a few hours, too.

chii

6 hours ago

imagine if the electricity supplier too that stance.

DiggyJohnson

2 hours ago

The electric grid is much more important than most private sector software projects by an order of magnitude.

Catastrophic data loss or lack of disaster recovery kills companies. AWS outages do not.

umeshunni

2 hours ago

What if the electricity grid depends on some AWS service?

quaintdev

2 hours ago

That would be circular dependency.

thatguy0900

2 hours ago

It doesn't though? Weird what if

jacquesm

an hour ago

You'd be surprised. See. GP asks a very interesting question. And some grid infra indeed relies on AWS, definitely not all of it but there are some aspects of it that are hosted by AWS.

croes

23 minutes ago

do you know for sure? And if not yet, you can bet someone will propose it in the future. So not a weird what if at all

malfist

5 hours ago

But that is the stance for a lot of electrical utilities. Sometimes weather or a car wreck takes out power and since its too expensive to have spares everywhere, sometimes you have to wait a few hours for a spare to be brought in

yuliyp

4 hours ago

No, that's not the stance for electrical utilities (at least in most developed countries, including the US): the vast majority of weather events cause localized outages (the grid as a whole has redundancies built in; distribution to (residential and some industrial) does not. It expects failures of some power plants, transmission lines, etc. and can adapt with reserve power, or, in very rare cases by partial degradation (i.e. rolling blackouts). It doesn't go down fully.

crote

2 hours ago

> Sometimes weather or a car wreck takes out power

Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.

Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.

Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.

roywiggins

an hour ago

I think this is right, but depending on where you live, local weather-related outages can still not-infrequently look like entire towns going dark for a couple days, not streets for hours.

(Of course that's still not the same as a big boy grid failure (Texas ice storm-sized) which are the things that utilities are meant to actively prevent ever happening.)

kelnos

2 hours ago

Fortunately nearly all services running on AWS aren't as important as the electric utility, so this argument is not particularly relevant.

And regardless, electric service all over the world goes down for minutes or hours all the time.

hnlmorg

5 hours ago

Utility companies do not have redundancy for every part of their infrastructure either. Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?

Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.

Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.

This is why "risk assessments" are a thing.

pyrale

4 hours ago

> Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".

hnlmorg

4 hours ago

> Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".

You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.

In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"

pyrale

3 hours ago

> nobody of any competency runs stuff on AWS complacently.

I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.

That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.

> As the meme goes "one does not simply just run something on AWS"

The meme has currency for a reason, unfortunately.

---

That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.

ahoka

5 hours ago

Isn't that basically Texas?

SecretDreams

5 hours ago

Texas is like if you ran your cloud entirely in SharePoint.

jofzar

5 hours ago

Let's not insult SharePoint like that.

It's like if you ran you cloud on an old dell box in your closet while your parent company is offering to directly host it in AWS for free.

sgarland

4 hours ago

Also, every time your cloud went down, the parent company begged you to reconsider, explaining that all they need you to do is remove the disturbingly large cobwebs so they can migrate it. You tell them that to do so would violate your strongly-held beliefs, and when they stare at you in bewilderment, you yell “FREEDOM!” while rolling armadillos at them like they’re bowling balls.

awillen

5 hours ago

That's the wrong analogy though. We're not talking about the supplier - I'm sure Amazon is doing its damnedest to make sure that AWS isn't going down.

The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.

energy123

7 hours ago

It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.

vrc

7 hours ago

Well technically AWS has never failed in wartime.

mlrtime

7 hours ago

I don't understand, peacetime?

mbreese

7 hours ago

Peacetime = When not actively under a sustained attack by a nation-state actor. The implication being, if you expect there to be a “wartime”, you should also expect AWS cloud outages to be more frequent during a wartime.

lII1lIlI11ll

4 hours ago

What about being actively attacked by multinational state or an empire? Does it count or not?

Why people keep using "nation-state" term incorrectly in HN comments is beyond me...

waisbrot

2 hours ago

I think people generally mean "state", but in the US-centric HN community that word is ambiguous and will generally be interpreted the wrong way. Maybe "sovereign state" would work?

dragonwriter

2 hours ago

As someone with a political science degree whose secondary focus was international relations, "Nation-state" has a number of different, definitions, an (despite the fact that dictionaries often don't include it), one of the most commonly encountered for a very long time has been "one of the principle subjects of international law, held to possess what is popularly, but somewhat inaccuratedly, referred to as Westphalian sovereignty" (there is a historical connection between this use and the "state roughly correlating with single nation" sense that relates to the evolution of “Westphalian sovvereignty” as a norm, but that’s really neither here nor there, because the meaning would be the meaning regardless of its connection to the other meaning.)

You almost never see the definition you are referring used except in the context of explicit comparison of different bases and compositions of states, and in practice there is very close to zero ambiguity which sense is meant, and complaining about it is the same kind of misguided prescriptivism as (also popular on HN) complaining about the transitive use of "begs the question" because it has a different sense than the intransitive use.

phpnode

3 hours ago

It sounds more technical than “country” and is therefore better

kelipso

2 hours ago

To me it sounds more like saying regime instead of government, gives off a sense of distance and danger.

mbreese

2 hours ago

It could be a multinational state actor, but the term nation-state is the most commonly used, regardless of accuracy. You can argue over whether of not the term itself is accurate, but you still understood the meaning.

smaudet

6 hours ago

Don't forget stuff like natural disasters and power failures...or just a very adventurous squirrel.

AWS (over-)reliance is insane...

mrits

6 hours ago

It makes a lot more sense if they had a typo of peak

__MatrixMan__

5 hours ago

Its a different kind of outage when the government disconnects you from the internet. Happens all the time, just not yet in the US.

yla92

4 hours ago

> there are some services that still run only in us-east-1.

What are those ?

cyberax

2 hours ago

> Not very many people realize that there are some services that still run only in us-east-1.

The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.

During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.

Ironically, our observability provider went down.

snowwrestler

5 hours ago

I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.

It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

SteveNuts

3 hours ago

> Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.

IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.

davedx

7 hours ago

> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

padjo

6 hours ago

It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.

Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.

afro88

7 hours ago

> If your company is in anything finance-adjacent or critical infrastructure

GP said:

> most companies

Most companies aren't finance-adjacent or critical infrastructure

philipallstar

6 hours ago

> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing

That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.

But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.

Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.

kelnos

2 hours ago

> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

This describes, what, under 1% of companies out there?

For most companies the cost of being multi-region is much more than just accepting with the occasional outage.

ants_everywhere

6 hours ago

I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.

malfist

5 hours ago

I worked for a fortune 500, twice a year we practiced our "catastrophe outage" plan. The target SLA for recovering from a major cloud provider outage was 48 hours.

Waterluvian

5 hours ago

Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.

hnlmorg

5 hours ago

Exactly this!

One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.

And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.

AndrewThrowaway

5 hours ago

It is like discussing zombie apocalypse. People who are invested in bunkers will hardly understand those who are just choosing death over living in those bunkers for a month longer.

coffeebeqn

2 hours ago

We started that planning process at my previous company after one such outage but it became clear very quickly that the costs of such resilience would be 2-3x hosting costs in perpetuity and who knows how many manhours. Being down for an hour was a lot more palatable to everyone

throw0101d

4 hours ago

> Planning for an AWS outage […]

What about if your account gets deleted? Or compromised and all your instances/services deleted?

I think the idea is to be able to have things continue running on not-AWS.

rjmunro

an hour ago

This. I wouldn't try to instantly failover to another service if AWS had a short outage, but I would plan to be able to recover from a permanent AWS outage by ensuring all your important data and knowledge is backed up off-AWS, preferably to your own physical hardware and having a vague plan of how to restore and bring things up again if you need to.

"Permanent AWS outage" includes someone pressing the wrong button in the AWS console and deleting something important or things like a hack or ransomware attack corrupting your data, as well as your account being banned or whatever. While it does include AWS itself going down in a big way, it's extremely unlikely that it won't come back, but if you cover other possibilities, that will probably be covered too.

maerF0x0

3 hours ago

Using AWS instead of a server in the closet is step 1.

Step 2 is multi-AZ

Step 3 is multi-region

Step 4 is multi-cloud.

Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+

joshuat

2 hours ago

Multi-cloud is a hole in which you can burn money and not much more

lumost

5 hours ago

This depends on the scale of company. A fully functional DR plan probably costs 10% of the infra spend + people time for operationalization. For most small/medium businesses its a waste to plan for a once per 3-10 year event. If you’re a large or legacy firm the above costs are trivial and in some cases it may become a fiduciary risk not to take it seriously.

jacquesm

5 hours ago

And if you're in a regulated industry it might even be a hard requirement.

psychoslave

6 hours ago

This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.

Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.

kxrm

5 hours ago

Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.

Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.

dangoldin

3 hours ago

I worked at an adtech company where we invested a bit in HA across AZ + regions. Lo and behold there was an AWS outage and we stayed up. Too bad our customers didn't and we still took the revenue hit.

Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.

pyrale

5 hours ago

What if AWS dumps you because your country/company didn't please the commander in chief enough?

If your resilience plan is to trust a third party, that means you don't really care about going down, does it?

Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.

Spooky23

5 hours ago

Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.

AWS US-East 1 has many outages. Anything significant should account for that.

lucideer

6 hours ago

> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

skywhopper

6 hours ago

That is the vast majority of customers on AWS.

nucleardog

5 hours ago

> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".

antihero

7 hours ago

My website running on an old laptop in my cupboard is doing just fine.

whatevaa

7 hours ago

When your laptop dies it's gonna be a pretty long outage too.

antihero

3 hours ago

I will find another one

api

7 hours ago

I have this theory of something I call “importance radiation.”

An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.

Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.

jacquesm

6 hours ago

That's a great concept. It explains a lot, actually!

YouAreWRONGtoo

5 hours ago

More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.

sgarland

4 hours ago

> APIs that don’t do what they document

Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.

This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.

YouAreWRONGtoo

2 hours ago

While I don't know your specific case, I have seen it happen often enough that there are only two possibilities left:

  1. they are idiots 
  2. they do it on purpose and they think you are an idiot
For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).

indoordin0saur

2 hours ago

Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.

zaphirplane

5 hours ago

> tune of a few hours every 5-10 years

You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long

coffeebeqn

2 hours ago

We don’t deploy to us-east but still so many of our API partners and 3rd party services were down a large chunk of the service was effectively down. Including stuff like many dev tools

Esophagus4

4 hours ago

It’s even worse than that - us-east-1 is so overloaded, and they have roughly 5+ outages per year on different services. They don’t publish outage numbers so it’s hard to tell.

At this point, being in any other region cuts your disaster exposure dramatically

sreekanth850

6 hours ago

Depends on how serious you are with SLA's.

temperceve

5 hours ago

Depends on the business. For 99% of them this is for sure the right answer.

kelseydh

6 hours ago

It seems like this can be mostly avoided by not using us-east-1.

DiffEq

5 hours ago

Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...

delfinom

3 hours ago

In before meteor strike takes a AWS region and they cant restore data.

jacquesm

7 hours ago

Thank you for illustrating my point. You didn't even bother to read the second paragraph.

shawabawa3

7 hours ago

> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt

jacquesm

7 hours ago

Is that also your contingency plan for 'user uploads objectionable content and alerts Amazon to get your account shut down'?

Make sure you let your investors know.

padjo

6 hours ago

If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.

jacquesm

6 hours ago

> If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.

Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.

matsemann

6 hours ago

"Is that also your contingency plan if unrelated X happens", and "make sure your investors know" are also not exactly good faith or without snark, mind you.

I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.

jacquesm

5 hours ago

The point is, many companies do need those nines and they count on AWS to deliver and there is no backup plan if they don't. And that's the thing I take issue with, AWS is not so reliable that you no longer need backups.

padjo

4 hours ago

My experience is that very few companies actually need those 9s. A company might say they need them, but if you dig in it turns out the impact on the business of dropping a 9 (or two) is far less than the cost of developing and maintaining an elaborate multi-cloud backup plan that will both actually work when needed and be fast enough to maintain the desired availability.

Again, of course there are exceptions, but advising people in general that they should think about what happens if AWS goes offline for good seems like poor engineering to me. It’s like designing every bridge in your country to handle a tomahawk missile strike.

chanux

7 hours ago

I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?

I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!

I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.

Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.

mlrtime

7 hours ago

We all read it.. AWS not coming back up is your point on nat having a backup plan?

You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.

I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).

ho_schi

6 hours ago

The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.

Resilient systems work autonomously and can synchronize - but don't need to synchronize.

    * Git is resilient.
    * Native E-Mail clients - with local storage enabled - are somewhat resilient.
    * A local package repository is - somewhat resilient.
    * A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.
We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.

CaptainOfCoit

6 hours ago

The internet seems resilient enough for all intents and purposes, we haven't had a global internet-wide catastrophe impacting the entire internet as far as I know, but we have gotten close to it sometimes (thanks BGP).

But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.

Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".

rjmunro

an hour ago

> The internet seems resilient enough...

The word "seems" is doing a lot of heavy lifting there.

CaptainOfCoit

32 minutes ago

I don't wanna jinx anything, but yeah, seems. I can't remember a single global internet outage for the 30+ years I've been alive. But again, large services gone down, but the internet infrastructure seems to keep on going regardless.

ho_schi

4 hours ago

Sweden and the “Coop” disaster:

https://www.bbc.com/news/technology-57707530

That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.

It usually gets worse, when not outages happens for some time. Because that increases blind trust.

CaptainOfCoit

4 hours ago

That a Swedish supermarket gets hit by a ransomware attack doesn't prove/disprove the overall stability of the internet, nor the fragility of the web.

jacquesm

4 hours ago

You are absolutely correct but this distinction is getting less and less important, everything is using APIs nowadays, including lots of stuff that is utterly invisible until it goes down.

bombcar

5 hours ago

The Internet was much more resilient when it was just that - an internetwork of connected networks; each of which could and did operate autonomously.

Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.

And partially working or indicating this it works (when it doesn’t) is usually even worse.

smaudet

5 hours ago

If you take into account the "the web" vs "the internet" as others have mentioned.

Yes the Internet has stayed stable.

The Web, as defined by a bunch of servers running complex software, probably much less so.

Just the fact that it must necessarily be more complex means that it has more failure modes...

raincole

7 hours ago

Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.

Frieren

6 hours ago

> Most companies just aren't important enough to worry about "AWS never come back up."

But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

raincole

6 hours ago

Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.

coffeebeqn

2 hours ago

Many have a hard dependency on AWS && Google && Microsoft!

paulddraper

4 hours ago

Exactly.

And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.

You can do the multi-region failover, though that's still possibly overkill for most.

rglover

3 hours ago

It would behoove a lot of devs to learn the basics of Linux sysadmin and how to setup a basic deployment with a VPS. Once you understand that, you'll realize how much of "modern infra" is really just a mix of over-reliance on AWS and throwing compute at underperforming code. Our addiction to complexity (and burning money on the illusion of infinite stability) is already and will continue to strangle us.

pmontra

6 hours ago

In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop our service. Who knows?

For small and medium sized companies it's not easy to perform an accurate due diligency.

rco8786

6 hours ago

If AWS goes down unexpectedly and never comes back up it's much more likely that we're in the middle of some enormous global conflict where day to day survival takes priority over making your app work than AWS just deciding to abandon their cloud business on a whim.

CaptainOfCoit

6 hours ago

Can also be much easier than that. Say you live in Mexico, hosting servers with AWS in the US because you have US customers. But suddenly the government decides to place sanctions on Mexico, and US entities are no longer allowed to do business with Mexicans, so all Mexican AWS accounts get shut down.

For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.

chasd00

7 minutes ago

what's most realistic is something like a major scandal at AWS. The FBI seizes control and no bytes come in our out until the investigation is complete. A multi-year total outage effectively.

apexalpha

6 hours ago

Or Trump decided your country does not deserve it.

bombcar

5 hours ago

Or Bezos.

dr-smooth

3 hours ago

Or Bezos selling his soul to the Orange Devil and kicking you off when the Conman-in-chief puts the squeeze on some other aspect of Bezos' business empire

JamesSwift

an hour ago

What good is jumping through extraordinary hoops to be multi cloud if docker, netlify, stripe, intercom, npm, etc all go down along with us-east-1?

fisf

an hour ago

Because you should not depend on one payment provider and pull unvendored images, packages, etc directly into your deployment.

There is no reason to have such brittle infra.

JamesSwift

38 minutes ago

Sure, but at that point you go from bog standard to "enterprise grade redundancy for every single point of failure" which I can assure you is more heavily engineered than many enterprises (source: see current outage). Its just not worth the manpower and dollars for a vast majority of businesses.

hvb2

7 hours ago

> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

lentil_soup

7 hours ago

> Decentralized in terms of many companies making up the internet

Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately

hvb2

7 hours ago

No we've not lost that at all. Nobody prevents you from doing that.

We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.

IlikeKitties

7 hours ago

> No we've not lost that at all. Nobody prevents you from doing that.

May I introduce you to our Lord and Slavemaster CGNAT?

otterley

3 hours ago

There’s more than one way to get a server on the Internet. You can pay a local data center to put your machine in one of their racks.

yupyupyups

6 hours ago

That depends on who your ISP is.

jacquesm

7 hours ago

I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

hvb2

7 hours ago

Absolutely, but the cost of perfection (100% uptime in this case) is infinite.

As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?

jacquesm

7 hours ago

Often simply the lack of a backup outside of the main cloud account.

hvb2

7 hours ago

Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?

And secondly, how often do you create that backup and are you willing to lose the writes since the last backup?

That backup is absolutely something people should have, but I doubt those are ever used to bring a service back up. That would be a monumental failure of your hosting provider (colo/cloud/whatever)

jacquesm

6 hours ago

> Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?

Not, but if some Amazon flunky decides to kill your account to protect the Amazon brand then you will at least survive, even if you'll lose some data.

psychoslave

5 hours ago

Well, that is exactly what resilient distributed network are about. Not that much the technical details we implement them through, but the social relationship and balanced in political decision power.

Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.

vahid4m

5 hours ago

I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.

kbar13

2 hours ago

the correct answer for those companies is "we have it on the roadmap but for right now accept the risk"

bschne

5 hours ago

I find this hard to judge in the abstract, but I'm not quite convinced the situation for the modal company today is worse than their answer to "what if your colo rack catches fire" would have been twenty years ago.

jacquesm

5 hours ago

> "what if your colo rack catches fire"

I've actually had that.

https://www.webmasterworld.com/webmaster/3663978.htm

bschne

4 hours ago

I used to work at an SME that ran ~everything on its own colo'd hardware, and while it never got this bad, there were a couple instances of the CTO driving over to the dc because the oob access to some hung up server wasn't working anymore. Fun times...

pluto_modadic

2 hours ago

oh hey, I've bricked a server remotely and had to drive 45 minutes to the DC to get badged in and reboot things :)

freetanga

6 hours ago

Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

jacquesm

6 hours ago

Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.

anal_reactor

6 hours ago

First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.

Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

jacquesm

6 hours ago

> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

psychoslave

5 hours ago

>Let me ask you: how do you prepare your website for the complete collapse of western society?

That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".

We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.

Keyframe

5 hours ago

At least we've got github steady with our code and IaaC, right? Right?!

csomar

6 hours ago

> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

weberer

7 hours ago

Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.

suralind

23 minutes ago

I wonder how their nines are going. Guess they'll have to stay pretty stable for the next 100 years.

d_burfoot

4 hours ago

I think AWS should use, and provide as an offering to big customers, a Chaos Monkey tool that randomly brings down specific services in specific AZs. Example: DynamoDB is down in us-east-1b. IAM is down in us-west-2a.

Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.

davidrupp

2 hours ago

jrochkind1

an hour ago

At some point AWS has so many services it's subject to a version of xkcd Rule 34 -- if you can imagine it, there's an AWS service for it.

davidrupp

25 minutes ago

I used to tell people there that my favorite development technique was to sit down and think about the system I wanted to build, then wait for it to be announced at that year's re:Invent. I called it "re:Invent and Simplify". "I" built my best stuff that way.

bob1029

4 hours ago

One thing has become quite clear to me over the years. Much of the thinking around uptime of information systems has become hyperbolic and self-serving.

There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.

I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.

The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.

Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.

vidarh

4 hours ago

> The takeaway I always have from these events is that you should engineer your business to be resilient

An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.

I'm getting old, but this was the 1980's, not the 1800's.

In other words, to agree with your point about resilience:

A lot of the time even some really janky fallbacks will be enough.

But to somewhat disagree with your apparent support for AWS: While it is true this attitude means you can deal with AWS falling over now and again, it also strips away one of the main reasons people tend to give me for why they're in AWS in the first place - namely a belief in buying peace of mind and less devops complexity (a belief I'd argue is pure fiction, but that's a separate issue). If you accept that you in fact can survive just fine without absurd levels of uptime, you also gain a lot more flexibility in which options are viable to you.

The cost of maintaining a flawless eject button is indeed high, but so is the cost of picking a provider based on the notion that you don't need one if you're with them out of a misplaced belief in the availability they can provide, rather than based on how cost effectively they can deliver what you actually need.

jskopek

2 hours ago

I would argue that you are still buying peace of mind by hosting on AWS, even when there are outages. This outage is front page news around the world, so it's not as much of a shock if your company's website goes down at the same time.

alpinisme

an hour ago

Some of the peace of mind comes just from knowing it’s someone else’s (technical) problem if the system goes down. And someone else’s problem to monitor the health of it. (Yes, we still have to monitor and fix all sorts of things related to how we’ve built our products, but there’s a nontrivial amount of stuff that is entirely the responsibility of AWS)

jrochkind1

2 hours ago

The cranked tills (or registers for the Americans) is an interesting example, because it seems safe to assume they don't have that equipment anymore, and could not so easily do that.

We have become much more reliant on digital tech (those hand cranked tills were prob not digital even when the electricity was on), and much less resilient to outages of such tech I think.

jezzamon

3 hours ago

Tech companies, and in particular ad-driven companies, keep a very close eye on their metrics and can fairly accurately measure the cost of an outage in real dollars

rdm_blackhole

5 minutes ago

My app deployed on Vercel and therefore indirectly deployed on us-east-1 was down for about 2 hours today then came back up and then went down again 10 minutes ago for 2 or 3 minutes. It seems like they are still intermittent issues happening.

motbus3

23 minutes ago

Always a lovely Monday when you wake just in time to see everything going down

aaronbrethorst

42 minutes ago

My ISP's DNS servers were inaccessible this morning. Cloudflare and Google's DNS servers have all been working fine, though: 1.1.1.1, 1.0.0.1, and 8.8.8.8

ecommerceguy

22 minutes ago

Just tried to get into Seller Central, returned a 504.

nextaccountic

7 hours ago

Is this why reddit is down? (https://www.redditstatus.com/ still says it is up but with degraded infrastructure)

krowek

7 hours ago

Shameless from them to make it look like it's a user problem. It was loading fine for me one hour ago, now I refresh the page and their message states I'm doing too many requests and should chill out (1 request per hour is too many for you?)

etothet

6 hours ago

Never ascribe to malice that which is adequately explained by incompetence.

It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.

anal_reactor

6 hours ago

I remember that I made a website and then I got a report that it doesn't work on newest Safari. Obviously, Safari would crash with a message blaming the website. Bro, no website should ever make your shitty browser outright crash.

balder1991

4 hours ago

Actually I’m just thinking that knowledge about how to crash Safari is valuable.

ryanchants

3 hours ago

Could be a bunch of reddit bots on AWS are now catching back up as AWS recovers and spiking hits to reddit

kaptainscarlet

7 hours ago

I got a rate limit error which didn't make sense since it was my first time opening reddit in hours.

igleria

9 hours ago

funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.

dijit

8 hours ago

AWS has made the internet into a single-point-of failure.

What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?

voidUpdate

7 hours ago

To be fair, there is another point of failure, Cloudflare. It seems like half the internet goes down when Cloudflare has one of their moments

polaris64

9 hours ago

It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189

miyuru

8 hours ago

I wonder if the new endpoint was affected as well.

dynamodb.us-east-1.api.aws

pardner

an hour ago

Darn, on Heroku even the "maintenance mode" (redirects all routes to a static url) won't kick in.

sineausr931

an hour ago

On a bright note, Alexa has stopped pushing me merchandise.

redeux

an hour ago

It’s a good day to be a DR software company or consultant

Liftyee

an hour ago

Damn. This is why Duolingo isn't working properly right now.

shinycode

7 hours ago

It’s that period of the year when we discover AWS clients that don’t have fallback plans

solatic

13 minutes ago

And yet, AMZN is up for the day. The market doesn't care. Crazy.

DanHulton

4 hours ago

I forget where I read it originally, but I strongly feel that AWS should offer a `us-chaos-1` region, where every 3-4 days, one or two services blow up. Host your staging stack there and you build real resiliency over time.

(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)

thomas_witt

9 hours ago

Seems to be really only in us-east-1, DynamoDB is performing fine in production on eu-central-1.

htrp

an hour ago

thundering herd problems.... every time they say they fix it something else breaks

karel-3d

5 hours ago

Slack was down, so I thought I will send message to my coworkers on Signal.

Signal was also down.

alimbada

an hour ago

E-mail still exists...

busymom0

43 minutes ago

For me Reddit is down and also the amazon home page isn't showing any items for me.

neuroelectron

an hour ago

Sounds like a circular error with monitoring is flooding their network with metrics and logs, causing DNS to fail and produce more errors, flooding the network. Likely root cause is something like DNS conflicts or hosts being recreated on the network. Generally this is a small amount of network traffic but the LBs are dealing with host address flux, causing the hosts to keep colliding host addresses as they attempt to resolve to a new host address which are being lost from dropped packets and with so many hosts in one AZ, there's a good chance they end up with a new conflicting address.

redwood

an hour ago

Surprising and sad to see how many folks are using DynamoDB There are more full featured multi-cloud options that don't lock you in and that don't have the single point of failure problems.

And they give you a much better developer experience...

Sigh

whatsupdog

9 hours ago

I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.

Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.

sinpor1

4 hours ago

His influence is so great that it caused half of the internet to stop working properly.

megous

an hour ago

I didn't even notice anything was wrong today. :) Looks like we're well disconnected from the US internet infra quasi-hegemony.

jpfromlondon

8 hours ago

This will always be a risk when sharecropping.

nla

5 hours ago

I still don't know why anyone would use AWS hosting.

fogzen

2 hours ago

Great. Hope they’re down for a few more days and we can get some time off.

TrackerFF

7 hours ago

Lots of outage happening in Norway, too. So I'm guessing it is a global thing.

gritzko

8 hours ago

idiocracy_window_view.jpg

testemailfordg2

7 hours ago

Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.

arielcostas

6 hours ago

But they aren't abusing their market power, are they? I mean, they are too big and should definitely be regulated but I don't think you can argue they are much of a monopoly when others, at the very least Google, Microsoft, Oracle, Cloudflare (depending on the specific services you want) and smaller providers can offer you the same service and many times with better pricing. Same way we need to regulate companies like Cloudflare essentially being a MITM for ~20% of internet websites, per their 2024 report.

croemer

8 hours ago

Coinbase down as well

killingtime74

8 hours ago

Signal is down for me

miduil

8 hours ago

Yes. https://status.signal.org/

    >  Signal is experiencing technical difficulties. We are working hard to restore service as quickly as possible.
Edit: Up and running again.

BaudouinVH

8 hours ago

canva.com was down until a few minutes ago.

hubertzhang

7 hours ago

I cannot pull images from docker hub.

tosh

8 hours ago

SES and signal seem to work again

ivape

an hour ago

There are entire apps like Reddit that are still not working. What the fuck is going on?

empressplay

9 hours ago

Can't check out on Amazon.com.au, gives error page

kondro

9 hours ago

This link works fine from Australia for me.

DataDaemon

9 hours ago

But but this is a cloud, it should exist in the cloud.

martinheidegger

8 hours ago

Designed to provide 99.999% durability and 99.999% availability Still designed, not implemented

xodice

7 hours ago

Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.

hvb2

7 hours ago

First, not all outages are created equal, so you cannot compare them like that.

I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.

But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.

AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.

xodice

7 hours ago

If AWS fully decentralized its control planes, they’d essentially be duplicating the cost structure of running multiple independent clouds and I understand that is why they don't however as long as AWS is reliant upon us-east-1 to function, they have not achieved what they claim to me. A single point of failure for IAM? Nah, no thanks.

Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.

That's poor design, after all these years. They've had time to fix this.

Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...

If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.

hvb2

7 hours ago

You can only decentralized your control plane if you don't have conflicting requirements?

Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem

xodice

7 hours ago

In the narrow distributed-systems sense? Yes, however those requirements are self-imposed. AWS chose strong global consistency for IAM and billing... they could loosen it at enormous expense.

The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.

I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.

The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.

In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.

AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.

nemo44x

5 hours ago

Someone’s got a case of the Monday’s.

avi_vallarapu

3 hours ago

This is the reason why it is important to plan Disaster recovery and also plan Multi-Cloud architectures.

Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.

Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.

- Qlik replicate - HexaRocket

and some more.

Or rather implement native replication solutions available with data platforms.