AWS multiple services outage in us-east-1

2246 pointsposted 4 months ago
by kondro

826 Comments

time0ut

4 months ago

Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.

The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.

Good reminder that you are only as strong as your weakest link.

SOLAR_FIELDS

4 months ago

This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.

citizenpaul

4 months ago

> temporary kludge shim was the perfect level of abstraction for the problem at hand.

Thats some nice manager deactivating jargon.

LPisGood

4 months ago

Manager deactivating jargon is a great phrase - it’s broadly applicable and also specific.

SOLAR_FIELDS

4 months ago

Yeah that sentence betrays my BigCorp experience it’s pulling from the corporate bullshit generator for sure

jordanb

4 months ago

Couldn't you just patch your coredns deployment to specify different forwarders?

SOLAR_FIELDS

4 months ago

Probably. This was years ago so the details have faded but I do recall that we did weigh about 6 different valid approaches of varying complexity in the war room before deciding this /etc/hosts hack was the right approach for our situation

nahumba

4 months ago

This is the en of the thread of the first comment. Now i can find below the second comment

1970-01-01

4 months ago

I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.

crote

4 months ago

Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?

kylecazar

4 months ago

Wishful thinking, but I hope an engineer somewhere got to ram a door down to fix a global outage. For the stories.

jedberg

4 months ago

Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.

On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".

terminalshort

4 months ago

> So security was basically "does someone else recognize you?"

I actually can't think of a more secure protocol. Doesn't scale, though.

goatking

4 months ago

Well, you put a lot of trust in the individuals in this case. A disgruntled employee can just let the bad guys in on purpose, saying "Yes they belong here".

terminalshort

4 months ago

That works until they run into a second person. In a big corp where people don't recognize each other you can also let the bad guys in, and once they're in nobody thinks twice about it.

0x457

4 months ago

Vulnerable to byzantine fault.

ermis

4 months ago

or it could be some troy maybe.

peterbecich

4 months ago

I would imagine this is how it works for the President and Cabinet

chasd00

4 months ago

way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.

/those were the days

bandrami

4 months ago

Oh I've definitely done that. They had remote hands but we were over our rack limit and we didn't want them to see inside.

The early oughts were a different time.

jedberg

4 months ago

Just to test the security, or...?

chasd00

4 months ago

late reply but, no, i really needed to hit the button but didn't have valid ID at the time. My driver's license was expired and i couldn't get it renewed because of a outstanding tickets iirc. I was able to talk my way in and had been there many times before so knew my way around and what words to say. I was able to do what i needed before another admin came up and told me that without valid ID they have no choice but to ask me to leave (probably like an insurance thing). I was being a bit dramatic when i said "getting thrown out" the datacenter guys were very nice and almost apologetic about asking me to leave.

UltraSane

4 months ago

I was in a datacenter when the fire alarm went off and all door locks were automatically disabled.

anthonyeden

4 months ago

Most modern commercial buildings in Australia unlock doors when the fire alarm goes off.

user

4 months ago

[deleted]

johnisgood

4 months ago

Lmao, so unathorized access on demand by pulling the fire alarm?

chasd00

4 months ago

There's some computer lore out there about someone tripping a fire alarm by accident or some other event that triggered a gas system used to put out fires without water but isn't exactly compatible with life. The story goes some poor sys admin had to stand there with their finger on like a pause button until the fire department showed up to disarm the system. If they released the button the gas would flood the whole DC.

UltraSane

4 months ago

Essentially yes. They should really divide data centers into zones and only unlock doors inside a zone where smoke is detected.

chasd00

4 months ago

> They should really divide data centers into zones and only unlock doors inside a zone where smoke is detected.

just make sure the zone based door lock/unlock system isn't on AWS ;)

ectospheno

4 months ago

Because surely every smoke detector will work while the building is burning down…

UltraSane

4 months ago

most data centers are made out of concrete and isolate fires.

ectospheno

4 months ago

My point is that while the failure rate may be low the failure method is dude burns to death in a locked server room. Even classified room protocols place safety of personnel over safety of data in an emergency.

UltraSane

4 months ago

Being in a server room with FM200 fire control is the safest place to be in a fire.

folmar

4 months ago

Don't ask about fire power switch

user

4 months ago

[deleted]

E39M5S62

4 months ago

That sounds like an Equinix datacenter. They were painfully slow at 350 E. Cermak.

jedberg

4 months ago

It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.

wolpoli

4 months ago

The story was that they had to use an angle grinder to get in.

jonbiggums22

4 months ago

I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.

dgl

4 months ago

brazzy

4 months ago

> To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Classic.

In my first job I worked on ATM software, and we had a big basement room full of ATMs for test purposes. The part the money is stored in is a modified safe, usually with a traditional dial lock. On the inside of one of them I saw the instructions on how to change the combination. The final instruction was: "Write down the combination and store it safely", then printed in bold: "Not inside the safe!"

gofreddygo

3 months ago

> It took an additional hour for the team to realize that the green light on the smart card reader did not, in fact, indicate that the card had been inserted correctly. When the engineers flipped the card over, the service restarted and the outage ended.

awesome !

paranoidrobot

4 months ago

That's a wonderful read, thanks for that.

prepend

4 months ago

This is how John Wick did it. He buried his gold and weapons in his garage and poured concrete over it.

selcuka

4 months ago

It only worked for Wick because he is a man of focus, commitment, and sheer will.

jamiek88

4 months ago

He’s not the bogey man. He’s the one you send to kill the fucking bogeyman.

Hooked from that moment! The series got progressively more ridiculous but what a start!

philipallstar

4 months ago

The bulletproof suits were very stylish though! So much fun.

6510

4 months ago

This is the way.

There is a video from the lock pick lawyer where he receives a padlock in the mail with so much tape that it takes him whole minutes to unpack.

Concrete is nice, other options are piles of soil or brick in front of the door. There probably is a sweet spot where enough concrete slows down an excavator and enough bricks mixed in the soil slows down the shovel. Extra points if there is no place nearby to dump the rubble.

jasonwatkinspdx

4 months ago

Probably one of those lost in translation or gradual exaggeration stories.

If you just wanted recovery keys that were secure from being used in an ordinary way you can use Shamir to split the key over a couple hard copies stored in safety deposit boxes a couple different locations.

hshdhdhehd

4 months ago

Louvre gang decides they can make more money contracting to AWS.

SoftTalker

4 months ago

The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.

bombcar

4 months ago

I prefer to use a sawzall and just go through the wall.

adrianmonk

4 months ago

The memory is hazy since it was 15+ years ago, but I'm fairly sure I knew someone who worked at a company whose servers were stolen this way.

The thieves had access to the office building but not the server room. They realized the server room shared a wall with a room that they did have access to, so they just used a sawzall to make an additional entrance.

chasd00

4 months ago

my across the street neighbor had some expensive bikes stolen this way. The thieves just cut a hole in the side of their garage from the alley, security cameras were facing the driveway and with nothing on the alley side. We (the neighborhood) think they were targeted specifically for the bikes as nothing else was stolen and your average crack head isn't going to make that level of effort.

oblio

4 months ago

That would be a sawswall, in that case.

bluGill

4 months ago

I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.

add a bunch of other poinless scifi and evil villan lair tropes in as well...

donalhunt

4 months ago

Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.

Still have my "my other datacenter is made of razorblades and hate" sticker. \o/

formerly_proven

4 months ago

They do commonly have poisonous gas though.

christkv

4 months ago

I had a summer job at a hospital one year in the data center when an electrician managed to trigger the halon system and we all had to evacuate and wait for the process to finish and the gas to vent. The four firetrucks and station master who shoved up was both annoyed and relieved it was not real.

maaaaattttt

4 months ago

Not sure if you’re joking but a relatively small datacenter I’m familiar with has reduced oxygen in it to prevent fires. If you were to break in unannounced you would faint or maybe worse (?).

bob778

4 months ago

Not quite - while you can reduce oxygen levels, they have to be kept within 4pp so at worst, will make you light headed. Many athletes train at the same levels though so it’s easy to overcome.

waste_monk

4 months ago

That'd make for a decent heist comedy - a bunch of former professional athletes get hired to break in to a low-oxygen data center, but the plan goes wrong and they have to use their sports skills in improbable ways to pull it off.

mrgoldenbrown

4 months ago

Halon was used back in the day for fire suppression but I thought it was only dangerous at high enough concentrations to suffocate you by displacing oxygen.

UltraSane

4 months ago

No FM200 isn't poisonous.

bdangubic

4 months ago

tell that to my dead uncle jack :)

ArnoVW

4 months ago

And lasers come to think of it

tacticus

4 months ago

there are datacentres not made of razorblades and hate?

lazide

4 months ago

Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.

Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.

geephroh

4 months ago

Sometimes a little good old fashioned mayhem is good for employee morale

lazide

4 months ago

Every good firefighter knows this feeling.

Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.

P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.

lenerdenator

4 months ago

Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.

I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.

"Meta Data Center Simulator 2021: As Real As It Gets (TM)"

UltraSane

4 months ago

Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.

holowoodman

4 months ago

Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)

jordanb

4 months ago

Auditors made you disable credential caching but missed the door that could be shimmed open..

AbstractH24

4 months ago

Sounds like they earned their fee!

UltraSane

4 months ago

If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.

avidphantasm

4 months ago

This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.

UltraSane

4 months ago

There should be many more smaller instances with smaller blast radius.

junon

4 months ago

Yep. And their internal comms were on the same server if memory serves. They were also down.

simplyluke

4 months ago

I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.

Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.

prmoustache

4 months ago

I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.

DevelopingElk

4 months ago

AWS, for the ultimate backup, relies on a phone call bridge on the public phone network.

gregw2

4 months ago

Ah, but have they verified how far down the turtles go, and has that changed since they verified it?

In the mid-2000s most of the conference call traffic started leaving copper T1s and going onto fiber and/or SIP switches managed by Level3, Global Crossing, Qwest, etc. Those companies combined over time into Century Link which was then rebranded Lumen.

As of last October, Lumen is now starting to integrate more closely with AWS, managing their network with AWS's AI: https://convergedigest.com/lumen-expands-fiber-network-to-su...

"Oh what a tangled web we weave..."

wbl

4 months ago

I once suggested at work that we list diesel distributors using payment infra not on on us near our datacenters.

junon

4 months ago

Thanks for the correction, that sounds right. I thought I had remembered IRC but wasn't sure.

bcrl

4 months ago

That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.

Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!

YokoZar

4 months ago

That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)

bcrl

4 months ago

That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.

YokoZar

4 months ago

Are you asserting that Rogers employees needed documentation to know that Rogers Wireless runs on Rogers systems?

bcrl

4 months ago

Rogers is perhaps best described as a confederacy of independent acquisitions. In working with their sales team, I have had to tell them where there facilities are as the sales engineers don't always know about all of the assets that Rogers owns.

There's also the insistence that Rogers employees should use Rogers services. Paying for every Rogers employee to have Bell cell phone would not sit well with their executives.

That the risk assessments of the changes being made to the router configuration were incorrect also contributed to the outage.

ttul

4 months ago

There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.

beefnugs

4 months ago

So sick of billion dollar companies not hiring that one more guy

ttul

4 months ago

That is perhaps why they are billion dollar companies and why my company is not very successful.

vladvasiliu

4 months ago

> Identity Center and only put it in us-east-1

Is it possible to have it in multiple regions? Last I checked, it only accepted one region. You needed to remove it first if you wanted to move it.

raverbashing

4 months ago

Security people and ignoring resiliency and failure modes: a tale as old as time

AndrewKemendo

4 months ago

Correct. That does make it a centralized failure mode and everyone is in the same boat on that.

I’m unaware of any common and popular distributed IDAM that is reliable

fheisler

4 months ago

Not sure if this counts fully as 'distributed' here, but we (Authentik Security) help many companies self-host authentik multi-region or in (private cloud + on-prem) to allow for quick IAM failover and more reliability than IAMaaS.

There's also "identity orchestration" tools like Strata that let you use multiple IdPs in multiple clouds, but then your new weakest link is the orchestration platform.

mooreds

3 months ago

Disclosure: I work for FusionAuth, a competitor of Authentik.

Curious. Is your solution active-active or active-passive? We've implemented multi-region active-passive CIAM/IAM in our hosted solution[0]. We've found that meets needs of many of our clients.

I'm only aware of one CIAM solution that seems to have active-active: Ory. And even then I think they shard the user data[1].

0: https://fusionauth.io/docs/get-started/run-in-the-cloud/disa...

1: https://www.ory.com/blog/global-identity-and-access-manageme... is the only doc I've found and it's a bit vague, tbh.

vinckr

3 months ago

Hey Dan, appreciate the discussion!

Ory’s setup is indeed true multi-region active-active; not just sharded or active-passive failover. Each region runs a full stack capable of handling both read and write operations, with global data consistency and locality guarantees.

We’ll soon publish a case study with a customer that uses this setup that goes deeper into how Ory handles multi-region deployments in production (latency, data residency, and HA patterns). It’ll include some of the technical details missing from that earlier blog post you linked. Keep an eye out!

There are also some details mentioned here: https://www.ory.com/blog/personal-data-storage

bravetraveler

4 months ago

> I’m unaware of any common and popular distributed IDAM that is reliable

Other clouds, lmao. Same requirements, not the same mistakes. Source: worked for several, one a direct competitor.

barbazoo

4 months ago

Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.

shdjhdfh

4 months ago

You should assume it will not work unless you test it regularly. That's a big part of why having active/active multi-region is attractive, even though it's much more complex.

shawabawa3

4 months ago

for what it's worth, we were unable to login with root credentials anyway

i don't think any method of auth was working for accessing the AWS console

reenorap

4 months ago

It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.

hinkley

4 months ago

Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.

Who watches the watchers.

ej_campbell

4 months ago

Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.

The usability of AWS is so poor.

ct520

4 months ago

I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright

ransom1538

4 months ago

People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.

ozim

4 months ago

Sounds like a lot of companies need to update their BCP after this incident.

michaelcampbell

4 months ago

"If you're able to do your job, InfoSec isn't doing theirs"

0x5345414e

4 months ago

This is having a direct impact on my wellbeing. I was at Whole Foods in Hudson Yards NYC and I couldn’t get the prime discount on my chocolate bar because the system isn’t working. Decided not to get the chocolate bar. Now my chocolate levels are way too low.

tonymet

4 months ago

"alexa turn on coffee pot" stopped working this morning, and I'm going bonkers.

jdlyga

4 months ago

Alexa is super buggy now anyway. I switched my Echo Dot to Alexa+, and it fails turning on and off my Samsung TV all the time now. You usually have to do it twice.

jaggederest

4 months ago

This has been my impetus to do Home Assistant things and I already can tell you that I'm going to spend far more time setting it up and tweaking it than I actually save, but false economy is a tinkerer's best friend. It's pretty impressive what a local LLM setup can do though, and I'm learning that all my existing smart devices are trivially available if anyone gets physical access to my network I guess!

manmal

4 months ago

This is the kind of thing Claude Code (bypassing permissions) shines at. I‘m about to setup HA myself and intend to not write a single line of config myself.

thesh4d0w

4 months ago

Most of HA is configured in the gui these days, you won't need to write any config anyways.

Ey7NFZ3P0nzAe

4 months ago

Something I love about HA is that all thr gui can always be directly edited using yaml. So you can ask claude for a v1 then tweak it a bit then finish with the gui. And all of this directly from the gui.

klabetron

4 months ago

Ugh. Reminds me that some time ago Siri stopped responding to “turn off my TV.” Now I have to remember to say “turn off my Apple TV.” (Which with the magic of HDMI CEC turns off my entire system.) Given how groggy I am when I want to turn off the TV, I often forget.

rstupek

4 months ago

I just use a "Alexa goodnight" to trigger turning off the tv and lights

GiorgioG

4 months ago

I upgraded to the old Alexa. Alexa+ is a hot pile of crap.

tonymet

4 months ago

i agree. the new LLM is better for dialog and Q&A, but they haven't properly tested intents and IOT integration at all.

clbrmbr

4 months ago

How can this be? I had great luck with GPT3 way back when… and I didn’t have function calling or chat… had to parse the JSON myself, extraction “action” and “response-text” fields… How has this been so hard for AMZN? Is it a matter of token cost and trying to use small models?

tonymet

4 months ago

that's a reasonable theory. they've likely delayed the launch this long due to the inference cost compared to the more basic Alexa engine.

I would also guess the testing is incomplete. Alexa+ is a slow roll out so they can improve precision/recall on the intents with actual customers. Alexa+ is less deterministic than the previous model was wrt intents

drnick1

4 months ago

Someone posting on HN should know better than using Alexa and Samsung TVs. These devices are a unique combination of malware and spyware.

pewpew_

4 months ago

I was attempting to use self checkout for some lunch I grabbed from the hotbar and couldn’t understand why my whole foods barcode was failing. It took me a full 20 seconds to realize the reason for the failure.

dewarrn1

4 months ago

This is a fun example, but now you've got me wondering: has anyone checked on folks who might have been in an Amazon Go store during the outage?

nxpnsv

4 months ago

Life indeed is a struggle

burnt-resistor

4 months ago

First World treatlerite problems. /s What's going to suck years after too many SREs/SWEs will have long been fired, like the Morlocks & Eloi and Idiocracy, there won't be anyone left who can figure out that plants need water. There will be a few trillionaires surrounded by aristocratic, unimaginable opulence while most of humanity toils in favelas surrounded by unfixable technology that seems like magic. One cargo cult will worship 5.25" floppy disks and their arch enemies will worship CD-Rs.

https://xkcd.com/2347/

kokanee

4 months ago

We're getting awfully close to that scenario. Like frogs in a warming kettle.

busyant

4 months ago

0th world problems

colechristensen

4 months ago

I had to buy a donut and the gas station with cash, like a peasant.

TZubiri

4 months ago

That's it, internet centralization has gone too far, call your congress(wo)man

JCM9

4 months ago

Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.

The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

radium3d

4 months ago

Once you've had an outage on AWS, Cloudflare, Google Cloud, Akismet. What are you going to do? Host in house? None of them seem to be immune from some outage at some point. Get your refund and carry on. It's less work for the same outcome.

CobrastanJorji

4 months ago

Multi-cloud. It's fairly unlikely that AWS and Google Cloud are going to fail at the same time.

radium3d

4 months ago

Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.

jancsika

4 months ago

> double++

I'd suggest to ++double the cost. Compare:

++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy

double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0

dexterdog

4 months ago

And likely far more than double the cost since you have to use the criminally-priced outbound bandwidth to keep everything in sync.

unethical_ban

4 months ago

Shouldn't be double in the long term. Think of the second cloud as a cold standby. Depends on the system. Periodic replication of data layer (object storage/database) and CICD configured to be able to build services and VMs on multiple clouds. Have automatic tests weekly/monthly that represent end-to-end functionality, have scaled tests semi-annually.

This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.

Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.

jimbob45

4 months ago

And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.

yeswecatan

4 months ago

How do you handle replication lag for databases?

zacmps

4 months ago

If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.

Breza

4 months ago

Why not host in house? If you have an application with stable resource needs, it can often be the cheaper and more stable option. At a certain scale, you can buy the servers, hire a sysadmin, and still spend less money than relying on AWS.

If you have an app that experiences 1000x demand spikes at unpredictable times then sure, go with the cloud. But there are a lot of companies that would be better off if they seriously considered their options before choosing the cloud for everything.

grogers

4 months ago

Certainly if you aren't even multi-region, then multi-cloud is a pipe dream

bean469

4 months ago

> What are you going to do? Host in house?

Yep. Although it's just anecdata, it's what we do where I work - haven't had a slightest issue in years.

nxpnsv

4 months ago

Cheaper, faster, in house people understands what’s going on. It should be a given for many services but somehow it’s not.

Breza

4 months ago

I totally agree with you. Where I work, we self-host almost everything. Exceptions are we use a CDN for one area where we want lower latency, and we use BigQuery when we need to parse a few billion datapoints into something usable.

It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.

erikpukinskis

4 months ago

On premise? Or do you build servers in a data center? Or do you lease dedicated servers?

bean469

3 months ago

We have our own data center with servers. The upfront costs are high, but it was worth it in our use-case

Breza

4 months ago

Not GP, but my company also self-hosts. We rent rackspace in a colo. We used to keep my team's research server in the back closet before we went full-remote.

cakeday

4 months ago

> Host in house?

Yes, mostly.

cmiles8

4 months ago

This. When Andy Jassy got challenged by analysts on the last earnings call on why AWS has fallen so far behind on innovation in areas his answer was a hand wavy response that diverted attention to say AWS is durable, stable, and reliable and customers care more about that. Oops.

judahmeek

4 months ago

behind on innovation how exactly?

sharpy

4 months ago

The culture changed. When I first worked there, I was encouraged to take calculated risks. When I did my second tour of duty, people were deathly afraid of bringing down services. It has been a while since my second tour of duty, but I don't think it's back to "Amazon is a place where builders can build".

everfrustrated

4 months ago

Somewhat inevitable for any company as they get larger. Easy to move fast and break things when you have 1 user and no revenue. Very different story when much of US commerce runs you on.

AbstractH24

4 months ago

For folks who came of age in the late 00's, seeing companies once thought of as disruptors and innovators become the old stalwarts post-pandemic/ZIRP has been quite an experience.

Maybe those who have been around longer have seen this before, but its the first time for me.

oblio

4 months ago

It's easy to be a hero when the going is easy.

llmslave

4 months ago

If you bring something down in a real way, you can forget about someone trusting you with a big project in the future. You basically need to switch orgs

chaostheory

4 months ago

Curious. When did AWS hit “Day Two”, or what year was your 2nd tour of duty?

sharpy

4 months ago

When they added the CM bar raiser, I felt like it hit day 2. When was that? 2014ish?

RedShift1

4 months ago

I've never heard tour of duty being used outside of the military, is it really that bad over at AWS it has to be called that?

sharpy

4 months ago

Nah, I used to work for defense contractors, and worked with ex-military people, so...

Anyway, I actually loved my first time at AWS. Which is why I went back. My second stint wasn't too bad, but I probably wouldn't go back, unless they offered me a lot more than what I get paid, but that is unlikely.

JCM9

4 months ago

I listened to the earnings call. I believe the question was mostly focused on why AWS has been so behind on AI. Jassy did flub the question quite badly and rambled on for a while. The press has mentioned the botched answer in a few articles recently.

etothet

4 months ago

They have been pushing me and company extremely hard to vet their various AI-related offerings. When we decide to look into whatever service it is, we come away underwhelmed. It seems like their biggest selling point so far is “we’ll give it to you free for several months”. Not great.

nijave

4 months ago

>we come away underwhelmed

In fairness, that's been my experience with everyone except OpenAI and Anthropic where I only occasionally come out underwhelmed

Really I think AWS does a fairly poor job bringing new services to market and it takes a while for them to mature. They excel much more in the stability of their core/old services--especially the "serverless" variety like S3, SQS, Lambda, EC2-ish, RDS-ish (well, today notwithstanding)

JCM9

4 months ago

I honestly feel bad for the folks at AWS whose job it is to sell this slop. I get AWS is in panic mode trying to catch up, but it’s just awful and frankly becoming quite exhausting and annoying for customers.

enjo

4 months ago

AWS was gutted by layoffs over the last couple of years. None of this is surprising.

mcmcmc

4 months ago

Why feel bad for them when they don’t? The paychecks and stock options keep them plenty warm at night.

JCM9

4 months ago

The comp might be decent but most folks I know that are still there say they’re pretty miserable and the environment is becoming toxic. A bit more pay only goes so far.

throw-the-towel

4 months ago

Sorry, "becoming" toxic? Amazon has been famous for being toxic since forever.

hansvm

4 months ago

It's a perspective issue. Amazon designs the first year to not "feel" toxic to most people. Thereafter, any semblance of propriety disappears.

worik

4 months ago

> stock options

Timing.

If Amazon has peaked then they will not be worth much. Shares go down. Even in rising markets shares of failing companies go down...

Mind tho, Amazon has so much mind share they will need to fail harder to fail totally...

ifwinterco

4 months ago

Everything except us-east-1 is generally pretty reliable. At $work we have a lot of stuff that's only on eu-west-1 (yes not the best practice) and we haven't had any issues, touch wood

ttul

4 months ago

My impression is that `us-east-1` has the worst reliability track record of any region. We've always run our stuff in `us-west-2` and there has never been an outage that took us down in that region. By contrast, a few things that we had in `us-east-1` have gone down repeatedly.

hnfong

4 months ago

Just curious, what's special about us-east-1?

stego-tech

4 months ago

It’s the “original” AWS region. It has the most legacy baggage, the most customer demand (at least in the USA), and it’s also the region that hosts the management layer of most “global” services. Its availability has also been dogshit, but because companies only care about costs today and not harms tomorrow, they usually hire or contract out to talent that similarly only cares about the bottom line today and throws stuff into us-east-1 rather than figure out AZs and regions.

The best advice I can give to any org in AWS is to get out of us-east-1. If you use a service whose management layer is based there, make sure you have break-glass processes in place or, better yet, diversify to other services entirely to reduce/eliminate single points of failure.

dijit

4 months ago

I have a joke from 15 years ago, where I compared my friend who flaked out all the time as "having less availability than US-EAST-1".

This is not a new issue caused by improper investment, it's always been this way.

riknos314

4 months ago

Former AWS employee here. There's a number of reasons but it mostly boils down to:

It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.

mvkel

4 months ago

It's closest to "geographical center" so traffic from Europe feels faster than us-west

tete

4 months ago

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.

Was quite some time ago so I don't have the data, but AWS never came out on top.

It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.

testplzignore

4 months ago

Netcraft confirmed it? I haven't heard that name since the Slashdot era :)

dotancohen

3 months ago

Get off my lawn you insensitive clod!

chaostheory

4 months ago

This makes sense given all the open source projects coming out of Netflix like chaos monkey.

eric-hu

4 months ago

Which cloud provider came out on top?

llmslave

4 months ago

AWS has been in long term decline, most of the platform is just in keeping the lights on mode. Its also why they are behind on AI, alot of would be innovative employees get crushed under red tape and performance management

nextworddev

4 months ago

Good thing they are the biggest investor into Anthropic

GoblinSlayer

4 months ago

But then you will be affected by outages of every dependency you use.

caymanjim

4 months ago

This is the real problem. Even if you don't run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn't matter if those services are in other availability zones. AWS's own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.

It really is a single point of failure for the majority of the Internet.

dexterdog

4 months ago

This becomes the reason to run in us-east-1 if you're going to be single region. When it's down nobody is surprised that your service is affected. If you're all-in on some other region and it goes down you look like you don't know what you're doing.

kelseydh

4 months ago

This whole incident has been pretty uneventful down in Australia where everything AWS is on ap-southeast-2.

parliament32

4 months ago

> Even if you don't run anything in AWS directly, something you integrate with will.

Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"

caymanjim

4 months ago

It's easy to say this, but in the real world, most of the critical path is heavily-dependent on third party integrations. User auth, storage, logging, etc. Even if you're somewhat-resilient against failures (i.e. you can live without logging and your app doesn't hard fail), it's still potentially going to cripple your service. And even if your entire app is resilient and doesn't fail, there are still bound to be tons of integrations that will limit functionality, or make the app appear broken in some way to users.

The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.

It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.

If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.

fauigerzigerk

4 months ago

Clearly these are non-trivial trade-offs, but I think using third parties is not an either or question. Depending on the app and the type of third-party service, you may be able to make design choices that allow your systems to survive a third-party outage for a while.

E.g., a hospital could keep recent patient data on-site and sync it up with the central cloud service as and when that service becomes available. Not all systems need to be linked in real time. Sometimes it makes sense to create buffers.

But the downside is that syncing things asynchronously creates complexity that itself can be the cause of outages or worse data corruption.

I guess it's a decision that can only be made on a case by case basis.

jen20

4 months ago

With the exception of Amazon, anyone in this situation already has a third-party product in their critical path - AWS itself.

chasd00

4 months ago

> Why would a third-party be in your product's critical path?

i bet only 1-2% of AI startups are running their own models and the rest are just bouncing off OpenAI, Azure, or some other API.

thinkindie

4 months ago

Not necessarily our critical path but today circleci was affected greatly which also affected our capacity to deploy. Luckily it was a Monday morning therefore we didn’t even have to deploy an hot fix.

pcdevils

4 months ago

That's nearly every ai start-up done for

macintux

4 months ago

No man is an island, entire of itself

user

4 months ago

[deleted]

unethical_ban

4 months ago

* IAM / Okta * Cloud VPN services * Cloud Office (GSuite, Office365)

Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.

mlavrent

4 months ago

The only ones I can really think of are the cloud providers themselves- I was at Microsoft, and absolutely everything was in-house (often to our detriment).

parliament32

4 months ago

I think you missed the "critical path" part. Why would your product stop functioning if your admins can't log in with IAM / VPN in, do you really need hands-on maintenance constantly? Why would your product stop functioning if Office is down, are you managing your ops in Excel or something?

"Some kind of dependency" is fine and unavoidable, but well-architected systems don't have hard downtime just because someone somewhere you have no control over fucked up.

unethical_ban

4 months ago

Since 2020 for some reason lot of companies have fully remote workforce. If the VPN or auth goes down and workers can't login, that's a problem. Think banks, call center work, customer service.

1-6

4 months ago

Glad that you're taking the first step toward resiliency. At times, big outages like these are necessary to give a good reason why the company should Multicloud. When things are working without problems, no one cares to listen to the squeaky wheel.

morshu9001

4 months ago

This was a single region outage, right? If you aren't cross-region, cross-cloud is the same but harder

jen20

4 months ago

I would be interested in a follow up in 2-3 years as to whether you've had fewer issues with a multi-cloud setup than just AWS. My suspicion is that will not be the case.

lootgraft

4 months ago

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud.

If an internal "AWS team" then this translates to "I am comfortable using this tool, and am uninterested in having to learn an entirely new stack."

If you have to diversify your cloud workloads give your devops team more money to do so.

ej_campbell

4 months ago

Aren't you deployed in multiple regions?

BoredPositron

4 months ago

Still no serverless inference for models or inference pipes that are not available on bedrock, still no auto scaling GPU workers. We started bothering them in 2022...crickets

wrasee

4 months ago

Please tell me there was a mixup and for some reason they didn’t show up.

indoordin0saur

4 months ago

Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.

markus_zhang

4 months ago

It has been quite a while, wondering how many 9s are dropped.

365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.

rdtsc

4 months ago

9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.

rvba

4 months ago

In a company where I worked, the tool measuring downtime was at the same server, so even if the server was down they still showed 100% up.

If the server didnt work - the tool too measure didnt work too! Genius

bityard

4 months ago

This happened to AWS too.

February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.

https://aws.amazon.com/message/41926/

CaptainOfCoit

4 months ago

hinkley

4 months ago

Five times is no longer a couple. You can use stronger words there.

bapak

4 months ago

It happened a murder of times.

hinkley

4 months ago

Ha! Shall I bookmark this for the eventual wiki page?

Scoundreller

4 months ago

Have we ever figured out what “red” means? I understand they’ve only ever gone to yellow.

kokanee

4 months ago

If it goes red, we aren't alive to see it

Cthulhu_

4 months ago

I'm sure we need to go to Blackwatch Plaid first.

decimalenough

4 months ago

I used to work at a company where the SLA was measured as the percentage of successful requests on the server. If the load balancer (or DNS or anything else network) was dropping everything on the floor, you'd have no 500s and 100% SLA compliance.

conductr

4 months ago

Similar to hosting your support ticketing system with same infra. "What problem? Nobody's complaining"

hinkley

4 months ago

I’ve been customer for at least four separate products where this was true.

I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>

bigiain

4 months ago

I spent enough time ~15 years back to find an external monitoring service that did not run on AWS and looked like a sustainable business instead of a VC fueled acquisition target - for our belts-n-braces secondary monitoring tool since it's not smart to trust CloudWatch to be able to send notifications when it's AWS's shit that's down.

Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.

(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)

6031769

4 months ago

Nagios is still a thing and you can host it wherever you like.

bigiain

4 months ago

Interestingly, the reason I originally looked for and started using it was an unapproved "shadow IT" response to an in-house Nagios setup that was configured and managed so badly it had _way_ more downtime than any of the services I'd get shouted about at if customers noticed them down before we did...

(No disrespect to Nagios, I'm sure a competently managed installation is capable of being way better than what I had to put up with.)

AbstractH24

4 months ago

If its not on the dashboard, it didn't happen

echelon

4 months ago

Common SLA windows are hour, day, week, month, quarter, and year. They're out of SLA for all of those now.

When your SLA holds within a joke SLA window, you know you goofed.

"Five nines, but you didn't say which nines. 89.9999...", etc.

SlightlyLeftPad

4 months ago

These are typically calculated system-wide, so if you include all regions, technically only a fraction of customers are impacted.

alkhimey

4 months ago

Customers in all regions were affected…

prmoustache

4 months ago

Indirectly yes but not directly.

Our only impact was some atlassian tools.

captainkrtek

4 months ago

I shoot for 9 fives of availability.

dare944

4 months ago

5555.55555% Really stupendous availableness!!!

Veserv

4 months ago

You added a zero. There are ~8760 hours per year, so 8 hours is ~1 in 1000, 99.9%.

nine_k

4 months ago

An outage like this does not happen every year, The last big outage happened in December 2021, roughly 3 years 10 month = 46 months ago.

The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.

They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.

hinkley

4 months ago

Won’t the end result be people keeping more servers warm in other AWS regions which means Amazon profits from their own fuckups?

pinkgolem

4 months ago

There was a pretty big outage 2023

codeduck

4 months ago

I'm sure they'll find some way to weasel out of this.

DevelopingElk

4 months ago

For DynamoDB, I'm not sure but I think its covered. https://aws.amazon.com/dynamodb/sla/. "An "Error" is any Request that returns a 500 or 503 error code, as described in DynamoDB". There were tons of 5XX errors. In addition, this calculation uses percentage of successful requests, so even partial degradation counts against the SLA.

From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/

The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.

alex_young

4 months ago

It's not down time, it's degradation. No outage, just degradation of a fraction[0] of the resources.

[0] Fraction is ~ 1

indoordin0saur

4 months ago

This 100% seems to be what they're saying. I have not been able to get a single Airflow task to run since 7 hours ago. Being able to query Redshift only recently came back online. Despite this all their messaging is that the downtime was limited to some brief period early this morning and things have been "coming back online". Total lie, it's been completely down for the entire business day here on the east coast.

randomname11

4 months ago

We continue to see early signs of progress!

Keyframe

4 months ago

It doesn't count. It's not downtime, it's unscheduled maintenance event.

8organicbits

4 months ago

Check the terms of your contract. The public terms often only offer partial service credit refunds, if you ask for it, via a support request.

hinkley

4 months ago

If you aren’t making $10 for every dollar you pay Amazon you need to look at your business model.

The refund they give you isn’t going to dent lost revenue.

hinkley

4 months ago

Where were you guys the other day when someone was calling me crazy for trying to make this same sort of argument?

abraae

4 months ago

I haven't done any RFP responses for a while but this question always used to make me furious. Our competitors (some of who had had major incidents in the past) claimed 99.99% availability or more, knowing they would never have to prove it, and knowing they were actually 100% until the day they weren't.

We were more honest, and it probably cost us at least once in not getting business.

d1sxeyes

4 months ago

An SLA is a commitment, and an RFP is a business document, not a technical one. As an MSP, you don’t think in terms of “what’s our performance”, you think of “what’s the business value”.

If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.

procaryote

4 months ago

it's a matter of perspective... 9.9999% is real easy

dgoldstein0

4 months ago

Only if you remember to spend your unavailability budget

hvb2

4 months ago

It's a single region?

I don't think anyone would quote availability as availability in every region I'm in?

While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.

They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.

reactordev

4 months ago

It’s THE region. All of AWS operates out of it. All other regions bow before it. Even the government is there.

idontwantthis

4 months ago

"The Cloud" is just a computer that you don't own that's located in Reston, VA.

hinkley

4 months ago

The Rot Starts at the Head.

derektank

4 months ago

AWS GovCloud East is actually located in Ohio IIRC. Haven't had any issues with GovCloud West today; I'm pretty sure they're logically separated from the commercial cloud.

vasco

4 months ago

> All of AWS operates out of it.

I don't think this is true anymore. In the early days bad enough outages in us-east-1 would bring down everything because some metadata / control pane stuff was there, I remember getting affected while in other regions, but there's been many years since this has happened.

Today for example no issues. I just avoid us-east-1 and everyone else should to. It's their worst region by far in terms of reliability because they launch all the new stuff there and are always messing it up.

Root_Denied

4 months ago

A secondary problem is that a lot of the internal tools are still on US East, so likely the response work is also being impacted by the outage. Been a while since there was a true Sev1 LSE (Large Scale Event).

phinnaeus

4 months ago

What the heck? Most internal tools were in Oregon when I worked in BT pre 2021.

Root_Denied

4 months ago

The primary ticketing system was up and down apparently, so tcorp/SIM must still have critical components there.

reactordev

4 months ago

tell me it isn't true while telling me there isn't an outage across AWS because us-east-1 is down...

vasco

4 months ago

I help run quite a big operation in a different region and had zero issues. And this has happened many times before.

reactordev

4 months ago

If that were true, you’d be seeing the same issues we are in us-west-1 as well. Cheers.

alkhimey

4 months ago

Global services such as STS have regional endpoints, but is it really that common to hit specific endpoint rather than use the default?

hamburglar

4 months ago

The regions are independent, so you measure availability for each on its own.

logifail

4 months ago

Except if they aren't quite as independent as people thought

hamburglar

4 months ago

Well that’s the default pattern anyway. When I worked in cloud there were always some services that needed cross-regional dependencies for some reason or other and this was always supposed to be called out as extra risk, and usually was. But as things change in a complex system, it’s possible for long-held assumptions about independence to change and cause subtle circular dependencies that are hard to break out of. Elsewhere in this thread I saw someone mentioning being migrated to auth that had global dependencies against their will, and I groaned knowingly. Sometimes management does not accept “this is delicate and we need to think carefully” in the midst of a mandate.

I do not envy anyone working on this problem today.

oxfordmale

4 months ago

But is is a partial outage only, so it doesn't count. If you retry a million times everything still works /s

outworlder

4 months ago

I'm wondering why your and other companies haven't just evicted themselves from us-east-1. It's the worst region for outages and it's not even close.

Our company decided years ago to use any region other than us-east-1.

Of course, that doesn't help with services that are 'global', which usually means us-east-1.

andrewl-hn

4 months ago

Several reasons, really:

1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"

2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.

3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.

4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.

5. Many Amazon features are available in that region first and then spread out to other locations.

6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.

7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?

It's the world's default hosting location, and today's outages show it.

derefr

4 months ago

> it's the cheapest region

In every SKU I've ever looked at / priced out, all of the AWS NA regions have ~equal pricing. What's cheaper specifically in us-east-1?

> Europe-friendly

Why not us-east-2?

> Many Amazon features are available in that region first and then spread out to other locations.

Well, yeah, that's why it breaks. Using not-us-east-1 is like using an LTS OS release: you don't get the newest hotness, but it's much more stable as a "build it and leave it alone" target.

> It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks.

This is a better argument, but in practice, it's very niche — 2-5ms of speed-of-light delay doesn't matter to anyone but HFT folks; anyone else can be in a DC one state away with a pre-arranged tier1-bypassing direct interconnect, and do fine. (This is why OVH is listed on https://www.cloudinfrastructuremap.com/ despite being a smaller provider: their DCs have such interconnects.)

For that matter, if you want "low-latency to North America and Europe, and high-throughput lowish-latency peering to many other providers" — why not Montreal [ca-central-1]? Quebec might sound "too far north", but from the fiber-path perspective of anywhere else in NA or Europe, it's essentially interchangeable with Virginia.

leptons

4 months ago

Lots of stuff is priced differently.

Just go to the EC2 pricing page and change from us-east-1 to us-west-1

https://aws.amazon.com/ec2/pricing/on-demand/

luhn

4 months ago

us-west-1 is the one outlier. us-east-1, us-east-2, and us-west-2 are all priced the same.

leptons

3 months ago

There are many other AWS regions than the ones you listed, and many different prices.

AbstractH24

4 months ago

This seems like a flaw Amazon needs to fix.

Incentivize the best behaviors.

Or is there a perspective I don't see?

leptons

4 months ago

How is it a flaw!? Building datacenters in different regions come with very different costs, and different costs to run. Power doesn't cost exactly the same in different regions. Local construction services are not priced exactly the same everywhere. Insurance, staff salaries, etc, etc... it all adds up, and it's not the same costs everywhere. It only makes sense that it would cost different amounts for the services run in different regions. Not sure how you're missing these easy to realize facts of life.

AbstractH24

4 months ago

I think the cost of a day like Monday due to over relying on a single location outweighs that

leptons

4 months ago

What happened on Monday has nothing to do with why services cost different prices in different regions.

AbstractH24

4 months ago

No, but it does reflect the dangers of incentivizing everyone to use a single region.

Most people (myself include) only choose it because its the cheapest. If multiple regions were the same price then there'd be less impact if one goes down.

leptons

4 months ago

The problems with us-east-1 have been apparent for a long time, many years. Once I started using us-east-1 long ago, and seeing the problems there, I moved everything to us-west-1 and stopped having those problems. EC2 instances were completely unreliable in us-east-1 (we were running hundreds to thousands at a time), not so in us-west-1. The error rates we were seeing were awful in us-east-1.

A negligible cost difference shouldn't matter when your apps are unstable due to the region being problematic.

AbstractH24

4 months ago

> A negligible cost difference shouldn't matter when your apps are unstable due to the region being problematic.

agreed, but a sizable cohort of people don't have the foresight or incentives for think past their nose and clicking the cheapest option.

So its on Amazon to incentivize what's best.

leptons

3 months ago

People's lack of curiosity, enough to not even explore the other options, is not Amazon's problem.

dclowd9901

4 months ago

> 5. Many Amazon features are available in that region first and then spread out to other locations.

This is the biggest one isn't it? I thought Route 53 isn't even available on any other region.

jedberg

4 months ago

Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.

bartread

4 months ago

> the occasional outage isn't worth the cost and effort of moving out.

And looked at from the perspective of an individual company, as a customer of AWS, the occasional outage is usually an acceptable part of doing business.

However, today we’ve seen a failure that has wiped out a huge number of companies used by hundreds of millions - maybe billions - of people, and obviously a huge number of companies globally all at the same time. AWS has something like 30% of the infra market so you can imagine, and most people reading this will to some extent have experienced, the scale of disruption.

And the reality is that whilst bigger companies, like Zoom, are getting a lot of the attention here, we have no idea what other critical and/or life and death services might have been impacted. As an example that many of us would be familiar with, how many houses have been successfully burgled today because Ring has been down for around 8 out of the last 15 hours (at least as I measure it)?

I don’t think that’s OK, and I question the wisdom of companies choosing AWS as their default infra and hosting provider. It simply doesn’t seem to be very responsible to be in the same pond as so many others.

Were I a legislator I would now be casting a somewhat baleful eye at AWS as a potentially dangerous monopoly, and see what I might be able to do to force organisations to choose from amongst a much larger pool of potential infra providers and platforms, and I would be doing that because these kinds of incidents will only become more serious as time goes on.

jedberg

4 months ago

You're suffering from survivorship bias. You know that old adage about the bullet holes in the planes, and someone pointed out that you should reinforce that parts without bullet holes, because these are the planes that came back.

It's the same thing here. Do you think other providers are better? If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

At least this way, everyone knows why it's down, our industry has developed best practices for dealing with these kinds of outages, and AWS can apply their expertise to keeping all their customers running as long as possible.

Perseids

4 months ago

> If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

That is the point, though: Correlated outages are worse than uncorrelated outages. If one payment provider has an outage, chose another card or another store and you can still buy your goods. If all are down, no one can shop anything[1]. If a small region has a power blackout, all surrounding regions can provide emergency support. If the whole country has a blackout, all emergency responders are bound locally.

[1] Except with cash – might be worth to keep a stash handy for such purposes.

bartread

4 months ago

Yeah, exactly this. I don’t know why the person who responded to me is talking about survivorship bias… and I suppose I don’t really care because there’s a bigger point.

The internet was originally intended to be decentralised. That decentralisation begets resilience.

That’s exactly the opposite of what we saw with this outage. AWS has give or take 30% of the infra market, including many nationally or globally well known companies… which meant the outage caused huge global disruption of services that many, many people and organisations use on a day to day basis.

Choosing AWS, squinted at through a somewhat particular pair of operational and financial spectacles, can often make sense. Certainly it’s a default cloud option in many orgs, and always in contention to be considered by everyone else.

But my contention is that at a higher level than individual orgs - at a societal level - that does not make sense. And it’s just not OK for government and business to be disrupted on a global scale because one provider had a problem. Hence my comment on legislators.

It is super weird to me that, apparently, that’s an unorthodox and unreasonable viewpoint.

But you’ve described it very elegantly: 99.99% (or pick the number of 9s you want) uptime with uncorrelated outages is way better than that same uptime with correlated, and particularly heavily correlated, outages.

bartread

4 months ago

That’s a pretty bold claim. Where’s your data to back it up?

More importantly you appear to have misunderstood the scenario I’m trying to avoid, which is the precise situation we’ve seen in the past 24 hours where a very large proportion of internet services go down all at the same time precisely because they’re all using the same provider.

And then finally the usual outcome of increased competition is to improve the quality of products and services.

I am very aware of the WWII bomber story, because it’s very heavily cited in corporate circles nowadays, but I don’t see that it has anything to do with what I was talking about.

AWS is chosen because it’s an acceptable default that’s unlikely to be heavily challenged either by corporate leadership or by those on the production side because it’s good CV fodder. It’s the “nobody gets fired for buying IBM” of the early mid-21st century. That doesn’t make it the best choice though: just the easiest.

And viewed at a level above the individual organisation - or, perhaps from the view of users who were faced with failures across multiple or many products and services from diverse companies and organisations - as with today (yesterday!) we can see it’s not the best choice.

mk89

4 months ago

This is an assumption.

Reality is, though, that you shouldn't put all your eggs in the same basket. And it was indeed the case before the cloud. One service going down would have never had this cascade effect.

I am not even saying "build your own DC", but we barely have resiliency if we all rely on the same DC. That's just dumb.

ytpete

4 months ago

From the standpoint of nearly every individual company, it's still better to go with a well-known high-9s service like AWS than smaller competitors though. The fact that it means your outages will happen at the same time as many others is almost like a bonus to that decision — your customers probably won't fault you for an outage if everyone else is down too.

That homogeneity is a systemic risk that we all bear, of course. It feels like systemic risks often arise that way, as an emergent result from many individual decisions each choosing a path that truly is in their own best interests.

bartread

4 months ago

Yeah, but this is exactly not what the internet is supposed to be. It’s supposed to be decentralised. It’s supposed to be resilient.

And at this point I’m looking at the problem and thinking, “how do we do that other than by legislating?”

Because left to their own devices a concerningly large number of people across many, many organisations simply follow the herd.

In the midst of a degrading global security situation I would have thought it would be obvious why that’s a bad idea.

twistedpair

4 months ago

Services like SES Inbound are only available in 2x US regions. AWS isn't great about making all services available in all regions :/

zamalek

4 months ago

We're on Azure and they are worse in every aspect, bad deployment of services, and status pages that are more about PR than engineering.

At this point, is there any cloud provider that doesn't have these problems? (GCP is a non-starter because a false-positive YouTube TOS violation get you locked out of GCP[1]).

[1]: https://9to5google.com/2021/02/26/stadia-port-of-terraria-ca...

tapoxi

4 months ago

Don't worry there was a global GCP outage a few months ago

ecshafer

4 months ago

Global auth is and has been a terrible idea.

derefr

4 months ago

[flagged]

hshdhdhj4444

4 months ago

That’s an incredibly long comment that does nothing to explain why a YouTube ToS violation should lead to someone’s GCP services being cut off.

Also, Steve Jobs already wrote your comment better. You should have just stolen it. “You’re holding it wrong”.

derefr

4 months ago

[flagged]

zamalek

4 months ago

Are you warned about the risks in an active war one? Yes.

Does Google warn you about this when you sign up? No.

And PayPal having the same problem in no way identifies Google. It just means that PayPal has the same problem and they are also incompetent (and they also demonstrate their incompetence in many other ways).

WesolyKubeczek

4 months ago

s/in no way identifies Google/in no way indemnifies Google/

Sorry

zamalek

4 months ago

> Sorry

No, thank you.

derefr

4 months ago

> It just means that PayPal has the same problem and they are also incompetent

Do you consider regular brick-and-mortar savings banks to be incompetent when they freeze someone's personal account for receiving business amounts of money into it? Because they all do, every last one. Because, again, they expect you to open a business account if you're going to do business; and they look at anything resembling "business transactions" happening in a personal account through the lens of fraud rather than the lens of "I just didn't realize I should open a business account."

And nobody thinks this is odd, or out-of-the-ordinary.

Do you consider municipal governments to be incompetent when they tell people that they have to get their single-family dwelling rezoned as mixed-use, before they can conduct business out of it? Or for assuming that anyone who is conducting business (having a constant stream of visitors at all hours) out of a residentially-zoned property, is likely engaging in some kind of illegal business (drug sales, prostitution, etc) rather than just being a cafe who didn't realize you can't run a cafe on residential zoning?

If so, I don't think many people would agree with you. (Most would argue that municipal governments suppress real, good businesses by not issuing the required rezoning permits, but that's a separate issue.)

There being an automatic level of hair-trigger suspicion against you on the part of powerful bureaucracies — unless and until you proactively provide those bureaucracies enough information about yourself and your activities for the bureaucracies to form a mental model of your motivations that makes your actions predictable to them — is just part of living in a society.

Heck, it's just a part of dealing with people who don't know you. Anthropologists suggest that the whole reason we developed greeting gestures like shaking hands (esp. the full version where you pull each-other in and use your other arms to pat one-another on the back) is to force both parties to prove to the other that they're not holding a readied weapon behind their backs.

---

> Are you warned about the risks in an active war one? Yes. Does Google warn you about this when you sign up? No.

As a neutral third party to a conflict, do you expect the parties in the conflict to warn you about the risks upon attempting to step into the war zone? Do you expect them to put up the equivalent of police tape saying "war zone past this point, do not cross"?

This is not what happens. There is no such tape. The first warning you get from the belligerents themselves of getting near either side's trenches in an active war zone, is running face-first into the guarded outpost/checkpoint put there to prevent flanking/supply-chain attacks. And at that point, you're already in the "having to talk yourself out of being shot" point in the flowchart.

It has always been the expectation that civilian settlements outside of the conflict zone will act of their own volition to inform you of the danger, and stop you from going anywhere near the front lines of the conflict. By word-of-mouth; by media reporting in newspapers and on the radio; by municipal governments putting up barriers preventing civilians from even heading down roads that would lead to the war zone. Heck, if a conflict just started "up the road", and you're going that way while everyone's headed back the other way, you'll almost always eventually be flagged to pull over by some kind stranger who realizes you might not know, and so wants to warn you that the only thing you'll get by going that way is shot.

---

Of course, this is all just a metaphor; the "war" between infrastructure companies and malicious actors is not the same kind of hot war with two legible "sides." (To be pedantic, it's more like the "war" between an incumbent state and a constant stream of unaffiliated domestic terrorists, such as happens during the ongoing only-partially-successful suppression of a populist revolution.)

But the metaphor holds: just like it's not a military's job to teach you that military forces will suspect that you're a spy if you approach a war zone in plainclothes; and just like it's not a bank's job to teach you that banks will suspect that you're a money launderer if you start regularly receiving $100k deposits into your personal account; and just like it's not a city government's job to teach you that they'll suspect you're running a bordello out of your home if you have people visiting your residentially-zoned property 24hrs a day... it's not Google's job to teach you that the world is full of people that try to abuse Internet infrastructure to illegal ends for profit; and that they'll suspect you're one of those people, if you just show up with your personal Google account and start doing some of the things those people do.

Rather, in all of these cases, it is the job of the people who teach you about life — parents, teachers, business mentors, etc — to explain to you the dangers of living in society. Knowing to not use your personal account for business, is as much a component of "web safety" as knowing to not give out details of your personal identity is. It's "Internet literacy", just like understanding that all news has some kind of bias due to its source is "media literacy."

zamalek

4 months ago

You may not be aware of this, but Paypal is unregulated. They can, and have, overreached. This is very different from a bank who has regulations to follow, some of which protect the consumer from the whims of the bank.

hosh

4 months ago

I appreciate this long comment.

I am in the middle of convincing the company I just joined to consider building on GCP instead of AWS (at the very least, not to default to AWS).

fragmede

4 months ago

If you can't figure out how to use a different Google account for YouTube from the GCP billing account, I don't know what to say. Google's in the wrong here, but spanner's good shit! (If you can afford it. and you actually need it. you probably don't.)

zamalek

4 months ago

The problem isn't specifically getting locked out of GCP (though it is likely to happen for those out of the loop on what happened). It is that Google themselves can't figure out that a social media ban shouldn't affect your business continuity (and access to email or what-have-you).

It is an extremely fundamental level of incompetence at Google. One should "figure out" the viability of placing all of one's eggs in the basket of such an incompetent partner. They screwed the authentication issue up and, this is no slippery slope argument, that means they could be screwing other things up (such as being able to contact a human for support, which is what the Terraria developer also had issues with).

kondro

4 months ago

One of those still isn’t us-east-1 though and email isn’t latency-bound.

hvb2

4 months ago

Except for OTP codes when doing 2fa in auth

kondro

4 months ago

100ms isn’t going to make a difference to email-based OTP.

Also, who’s using email-based OTP?

shermantanktop

4 months ago

Same calculation everyone makes but that doesn’t stop them from whining about AWS being less than perfect.

indoordin0saur

4 months ago

We have discussions coming up to evict ourselves from AWS entirely. Didn't seem like there was much of an appetite for it before this but now things might have changed. We're still small enough of a company to where the task isn't as daunting as it might otherwise be.

sleepybrett

4 months ago

So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.

I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.

DrBenCarson

4 months ago

Yep. Many, many companies are fine saying “we’re going to be no more available than AWS is.”

frankchn

4 months ago

Customers are generally a lot more understanding if half the internet goes down at the same time as you.

chrisweekly

4 months ago

Yes, and that's a major reason so many just use us-east-1.

lordnacho

4 months ago

Is there some reason why "global" services aren't replicated across regions?

I would think a lot of clients would want that.

JoshTriplett

4 months ago

> Is there some reason why "global" services aren't replicated across regions?

On AWS's side, I think us-east-1 is legacy infrastructure because it was the first region, and things have to be made replicable.

For others on AWS who aren't AWS themselves: because AWS outbound data transfer is exorbitantly expensive. I'm building on AWS, and AWS's outbound data transfer costs are a primary design consideration for potential distribution/replication of services.

me551ah

4 months ago

It is absolutely crazy how much AWS charges for data. Internet access in general has become much cheaper and Hetzner gives unlimited AWS. I don't recall AWS ever decreasing prices for outbound data transfer

Sanzig

4 months ago

I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.

And yes, AWS' rates are highway robbery. If you assume $1500/mo for a 10 Gbps port from a transit provider, you're looking at $0.0005/GB with a saturated link. At a 25% utilization factor, still only $0.002/GB. AWS is almost 50 times that. And I guarantee AWS gets a far better rate for transit than list price, so their profit margin must be through the roof.

JoshTriplett

4 months ago

> I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.

Which makes sense, but even their rates for traffic between AWS regions are still exorbitant. $0.10/GB for transfer to the rest of the Internet somewhat discourages integration of non-Amazon services (though you can still easily integrate with any service where most of your bandwidth is inbound to AWS), but their rates for bandwidth between regions are still in the $0.01-0.02/GB range, which discourages replication and cross-region services.

If their inter-region bandwidth pricing was substantially lower, it'd be much easier to build replicated, highly available services atop AWS. As it is, the current pricing encourages keeping everything within a region, which works for some kinds of services but not others.

mnutt

4 months ago

Even their transfer rates between AZs _in the same region_ are expensive, given they presumably own the fiber?

This aligns with their “you should be in multiple AZs” sales strategy, because self-hosted and third-party services can’t replicate data between AZs without expensive bandwidth costs, while their own managed services (ElastiCache, RDS, etc) can offer replication between zones for free.

immibis

4 months ago

Hetzner is "unlimited fair use" for 1Gbps dedicated servers, which means their average cost is low enough to not be worth metering, but if you saturate your 1Gbps for a month they will force you to move to metered. Also 10Gbps is always metered. Metered traffic is about $1.50 per TB outbound - 60 times cheaper than AWS - and completely free within one of their networks, including between different European DCs.

In general it seems like Europe has the most internet of anywhere - other places generally pay to connect to Europe, Europe doesn't pay to connect to them.

zikduruqe

4 months ago

"Is there some reason why "global" services aren't replicated across regions?"

us-east-1 is so the government to slurp up all the data. /tin-foil hat

rhplus

4 months ago

Data residency laws may be a factor in some global/regional architectures.

lordnacho

4 months ago

So provide a way to check/uncheck which zones you want replication to. Most people aren't going to need more than a couple of alternatives, and they'll know which ones will work for them legally.

DevelopingElk

4 months ago

My guess is that for IAM it has to do with consistency and security. You don't want regions disagreeing on what operations are authorized. I'm sure the data store could be distributed, but there might be some bad latency tradeoffs.

The other concerns could have to do with the impact of failover to the backup regions.

belter

4 months ago

Regions disagree on what operations are authorized. :-) IAM uses eventual consistency. As it should...

"Changes that I make are not always immediately visible": - "...As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any changes that you make in IAM (or other AWS services), including attribute-based access control (ABAC) tags, take time to become visible from all possible endpoints. Some delay results from the time it takes to send data from server to server, replication zone to replication zone, and Region to Region. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out...

...You must design your global applications to account for these potential delays. Ensure that they work as expected, even when a change made in one location is not instantly visible at another. Such changes include creating or updating users, groups, roles, or policies. We recommend that you do not include such IAM changes in the critical, high availability code paths of your application. Instead, make IAM changes in a separate initialization or setup routine that you run less frequently. Also, be sure to verify that the changes have been propagated before production workflows depend on them..."

https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoo...

bcrosby95

4 months ago

Global replication is hard and if they weren't designed with that in mind its probably a whole lot of work.

ineedasername

4 months ago

I thought part of the point of using AWS was that such things were pretty much turnkey?\

DevelopingElk

4 months ago

Mostly AWS relies on each region being its own isolated copy of each service. It gets tricky when you have globalized services like IAM. AWS tries to keep those to a minimum.

oofbey

4 months ago

One advantage to being in the biggest region: when it goes down the headlines all blame AWS, not you. Sure you’re down too, but absolutely everybody knows why and few think it’s your fault.

nijave

4 months ago

For us, we had some minor impacts but most stuff was stable. Our bigger issue was 3rd party SaaS also hosted on us-east-1 (Snowflake and CircleCI) which broke CI and our data pipeline

Eridrus

4 months ago

This was a major issue, but it wasn't a total failure of the region.

Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.

I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.

We definitely learnt something here about both our software and our 3rd party dependencies.

throwaway-aws9

4 months ago

You have to remember that health status dashboards at most (all?) cloud providers require VP approval to switch status. This stuff is not your startup's automated status dashboard. It's politics, contracts, money.

hinkley

4 months ago

Which makes them a flat out lie since it ceases to be a dashboard if it’s not live. It’s just a status page.

PeterCorless

4 months ago

Downdetector had 5,755 reports of AWS problems at 12:52 AM Pacific (3:53 AM Eastern).

That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).

However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).

Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.

rogerrogerr

4 months ago

Where do they source those reports from? Always wondered if it was just analysis of how many people are looking at the page, or if humans somewhere are actually submitting reports.

SteveNuts

4 months ago

It turns out that a bunch of people checking if "XYZ is down" is a pretty good heuristic for it actually being down. It's pretty clever I think.

jedberg

4 months ago

It's both. They count a hit from google as a report of that site being down. They also count that actual reports people make.

hunter2_

4 months ago

So if my browser auto-completes their domain name and I accept that (causing me to navigate directly to their site and then I click AWS) it's not a report; but if my browser doesn't or I don't accept it (because I appended "AWS" after their site name) causing me to perform a Google search and then follow the result to the AWS page on their site, it's a report? That seems too arbitrary... they should just count the fact that I went to their AWS page regardless of how I got to it.

jedberg

4 months ago

I don't know the exact details, but I know that hits to their website do count as reports, even if you don't click "report". I assume they weight it differently based on how you got there (direct might actually be more heavily weighted, at least it would be if I was in charge).

mjrpes

4 months ago

Down detector agrees: https://downdetector.com/status/amazon/

Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status

ilamont

4 months ago

Search, Seller Central, Amazon Advertising not working properly for me. Attempting to access from New York.

When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.

belter

4 months ago

This looks like one their worst outage in 15 years and us-east-1 still shows as degraded but I had no outages, as dont use us-east-1. Are you seeing issues on other regions?

https://health.aws.amazon.com/health/status?path=open-issues

The closest to their identification of a root cause seems to be this one:

"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."

hinkley

4 months ago

I wonder how many people discovered their autoscaling settings went batshit when services went offline, either scaling way down or way up, or went metastable and started fishtailing.

jread

4 months ago

Lambda create-function control plane operations are still failing with InternalError for us - other services have recovered (Lambda, SNS, SQS, EFS, EBS, and CloudFront). Cloud availability is the subject of my CS grad research, I wrote a quick post summarizing the event timeline and blast radius as I've observed it from testing in multiple AWS test accounts: https://www.linkedin.com/pulse/analyzing-aws-us-east-1-outag...

Forricide

4 months ago

Definitely seems to be getting worse, outside of AWS itself, more websites seem to be having sporadic or serious issues. Concerning considering how long the outage has been going.

busymom0

4 months ago

That's probably why Reddit has been down too

whaleofatw2022

4 months ago

Dangerous curiosity ask, is whether the number of folks off for Diwali is a factor or not?

I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.

loudmax

4 months ago

Northern Virginia's Fairfax County public schools have the day off for Diwali, so that's not an unreasonable question.

In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.

If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.

dd_xplore

4 months ago

It's more worse if caused by American engineers , not on holiday

hinkley

4 months ago

Seems like a lot of people missing that this post was made around midnight PST time and thus it would be more reasonable to ping people at lunch in IST before waking up people in EST or PST.

hinkley

4 months ago

More info is claiming the problem started around 9:15 the previous day, but brewed for a while. But that’s still after breakfast in IST.

hinkley

4 months ago

Sometimes I miss my phone buzzing when doing yard work. Diwali has to be worse for that.

junon

4 months ago

Seeing as how this is us-east-1, probably not a lot.

redeux

4 months ago

I believe the implication is that a lot of critical AWS engineers are of Indian descent and are off celebrating today.

herewulf

4 months ago

junon's implication may be that AWS engineers of Indian descent would tend to be located on the West Coast.

jmedefind

4 months ago

North Virginia has a very large Indian community.

All the schools in the area have days off for Indian Holidays since so many would be out of school otherwise.

hinkley

4 months ago

This broke in the middle of the day IST did it not? Why would you start waking up people in VA if it’s 3 in the morning there if you don’t have to?

AutoDunkGPT

4 months ago

I bet you haven't gotten an email back from AWS support during twilight hours before.

There are 153k Amazon employees based in India according to LinkedIn.

junon

4 months ago

Missing my point entirely.

hinkley

4 months ago

Then I missed it too because I let my Indian coworkers handle production issues after 9,10pm unless the problem sounds an awful lot like the feature toggle I flipped on in production is setting servers on fire.

My main beef with that team was that we worked on too many stories in parallel so information on brand new work was siloed. Everyone caught up after a bit but stuff we just or hadn’t demoed yet was spotty for coverage.

If I was up at 1 am it was because I had insomnia and figured out exactly what the problem was and it was faster to fix it than to explain. Or if I wake up really early and the problem is still not fixed.

napolux

4 months ago

worst of all: ring alarm unstoppable siren because the app is down and the keyboard was removed by my parents and put "somewhere in the basement".

bartread

4 months ago

Is it hard wired? If so, and if the alarm module doesn’t have an internal battery, can you go to the breaker box and turn off the circuit it’s on? You should be able to switch off each breaker in turn until it stops if you don’t know which circuit it’s on.

If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.

Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.

napolux

4 months ago

I'll keep it in mind, thx. I was lucky to find the keypad in the "this is the place where we put electronic shit" in the basement.

bartread

4 months ago

Nice. Well, whatever, I’m glad you managed to stop it from driving you up the wall.

Liftyee

4 months ago

I have a Ring alarm. It has a battery backup and is powered by AC adaptor, so no need to turn off entire circuits (but no easy silence). All the sensors I have are wireless (not sure if they offer wired).

I would honestly do your box option. Stuff it in there with some pillows and leave it in the shed for a while.

bartread

4 months ago

Yeah, we’ve got a bunch of Ring stuff but not the interior alarm so I wasn’t sure how it worked. I suspected it might have a battery backup and, in that case, desperate times -> desperate measures.

autophagian

4 months ago

Yeah. We had a brief window where everything resolved and worked and now we're running into really mysterious flakey networking issues where pods in our EKS clusters timeout talking to the k8s API.

cj00

4 months ago

Yeah, networking issues cleared up for a few hours but now seem to be as bad as before.

mvdtnz

4 months ago

The problems now seem mostly related to starting new instances. Our capacity is slowly decaying as existing services spin down and new EC2 workloads fail to start.

baubino

4 months ago

Basic services at my worksite have been offline for almost 8 hours now (things were just glitchy for about 4 hours before that). This is nuts.

indoordin0saur

4 months ago

Have not gotten a data pipeline to run to success since 9AM this morning when there was a brief window of functioning systems. Been incredibly frustrating seeing AWS tell the press that things are "effectively back to normal". They absolutely are not! It's still a full outage as far as we are concerned.

assholesRppl2

4 months ago

Yep, confirmed worse - DynamoDB now returning "ServiceUnavailableException"

claudiug

4 months ago

ServiceUnavailableException hello java :)

dutzi

4 months ago

Here as well…

JCM9

4 months ago

Agree… still seeing major issues. Briefly looked like it was getting better but things falling apart again.

tlogan

4 months ago

I noticed the same thing and it seems to have gotten much worse around 8:55 a.m. Pacific Time.

By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.

wavemode

4 months ago

SEV-0 for my company this morning. We can't connect to RDS anymore.

jmuguy

4 months ago

Yeah we were fine until about 1030 eastern and have been completely down since then, Heroku customer.

davedx

4 months ago

Andy Jassy is the Tim Cook of Amazon

Rest and vest CEOs

hinkley

4 months ago

Don’t insult Tim Cook like that.

He got a lot of impossible shit done as COO.

They do need a more product minded person though. If Jobs was still around we’d have smart jewelry by now. And the Apple Watch would be thin af.

perching_aix

4 months ago

In addition to those, Sagemaker also fails for me with an internal auth error specifically in Virginia. Fun times. Hope they recover by tomorrow.

steveBK123

4 months ago

Agreed, every time the impacted services list internally gets shorter, the next update it starts growing again.

A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.

hinkley

4 months ago

Before my old company spun off, we didn’t know the old ops team had put on-prem production and our Atlassian instances in the same NAS.

When the NAS shit the bed, we lost half of production and all our run books. And we didn’t have autoscaling yet. Wouldn’t for another 2 years.

Our group is a bunch of people that has no problem getting angry and raising voices. The whole team was so volcanically angry that it got real quiet for several days. Like everyone knew if anyone unclenched that there would be assault charges.

jonplackett

4 months ago

The problem is now that, what’s anyone going to do? Leave?

I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.

Same meme would work for Aws today.

MaKey

4 months ago

> Same meme would work for Aws today.

Not really, there are enough alternatives.

jonplackett

4 months ago

How any just run on AWS underneath though?

And it’s not lie there aren’t other brands of chocolate either…

hinkley

4 months ago

It’s amazing how much you can avoid them by eating food that still looks like what it started as though. They own a lot of processed food.

ljdtt

4 months ago

first time i see "fubar", is that a common expression on the industry? jsut curious (english is not my native language)

sorentwo

4 months ago

It is an old US military term that means “F*ked Up Beyond All Recognition”

dingnuts

4 months ago

FUBAR being a bit worse than SNAFU: "situation normal: all fucked up" which is the usual state of us-east-1

D-Coder

4 months ago

My favorite is JANFU: Joint Army-Navy Fuck-Up.

joeyphoen

4 months ago

But you probably have seen the standard example variable names "foo" and "bar" which (together at least) come from `fubar`

sunnybeetroot

4 months ago

Which are in fact unrelated.

jameshart

4 months ago

Unclear. ‘Foo’ has a life and origin of its own and is well attested in MIT culture going back to the 1930s for sure, but it seems pretty likely that it’s counterpart ‘bar’ appears in connection with it as a comical allusion to FUBAR.

worik

4 months ago

Foobar == "Fucked up beyond all recognition "

Even the acronym is fucked.

My favorite by a large margin...

gregw2

4 months ago

Interestingly, it was "Fouled Up Beyond All Recognition" when it first appeared in print back towards the end of World War 2.

https://en.wikipedia.org/wiki/List_of_military_slang_terms#F...

Not to be confused with "Foobar" which apparently originated at MIT: https://en.wikipedia.org/wiki/Foobar

TIL, an interesting footnote about "foo" there:

'During the United States v. Microsoft Corp. trial, evidence was presented that Microsoft had tried to use the Web Services Interoperability organization (WS-I) as a means to stifle competition, including e-mails in which top executives including Bill Gates and Steve Ballmer referred to the WS-I using the codename "foo".[13]'

jameshart

4 months ago

What people would print and what soldiers would say in the 1940s were likely somewhat divergent.

vishnugupta

4 months ago

It used to be quite common but has fallen out of usage.

loudmax

4 months ago

"FUBAR" comes up in the movie Saving Private Ryan. It's not a plot point, but it's used to illustrate the disconnect between one of the soldiers dragged from a rear position to the front line, and the combat veterans in his squad. If you haven't seen the movie, you should. The opening 20 minutes contains one of the most terrifying and intense combat sequences ever put to film.

ibejoeb

4 months ago

Honestly not sure if this is a joke I'm not in on.

There are documented uses of FUBAR back into the '40s.

awalsh128

4 months ago

What do you mean? The movie storyline takes place in 44 at the Battle of Normandy.

ibejoeb

4 months ago

I must've misread. I thought you said that it comes from the movie rather than comes up in the movie.

user

4 months ago

[deleted]

strictnein

4 months ago

FUBAR: Fucked Up Beyond All Recognition

Somewhat common. Comes from the US military in WW2.

nikolay

4 months ago

Choosing us-east-1 as your primary region is good, because when you're down, everybody's down, too. You don't get this luxury with other US regions!

rdhatt

4 months ago

One unexpected upside moving from a DC to AWS is when a region is down, customers are far more understanding. Instead of being upset, they often shrug it off since nothing else they needed/wanted was up either.

ta1243

4 months ago

It took me so long to realise this is what's important in enterprise. Uptime isn't important, being able to blame someone else is what's important.

If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.

If you're down for 5 hours a year but this affected other companies too, it's not your fault

From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.

When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.

If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.

As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.

shermantanktop

4 months ago

A slightly less cynical view: execs have a hard filter for “things I can do something about” and “things I can’t influence at all.” The bad ones are constantly pushing problems into the second bucket, but there are legitimately gray area cases. When an exec smells the possibility that their team could have somehow avoided a problem, that’s category 1 and the hammer comes down hard.

After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.

sam1r

4 months ago

Sometimes we all need a tech shutdown.

Gigachad

4 months ago

I was once told that our company went with Azure because when you tell the boomer client that our service is down because Microsoft had an outage, they go from being mad at you, to accepting that the outage was an act of god that couldn’t be avoided.

Sparkyte

4 months ago

I am down with that lets all build in US-East-1.

kelseydh

4 months ago

Is us-east-1 equally unstable to the other regions? My impression was that Amazon deployed changes to us-east-1 first so it's the most unstable region.

ej_campbell

4 months ago

And all your dependencies are co-located.

tokioyoyo

4 months ago

Doing pretty well up here in Tokyo region for now! Just can't log into console and some other stuff.

stepri

4 months ago

“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”

It’s always DNS.

Nextgrid

4 months ago

I wonder how much of this is "DNS resolution" vs "underlying config/datastore of the DNS server is broken". I'd expect the latter.

koliber

4 months ago

It's always US-EAST-1 :)

shamil0xff

4 months ago

Might just be BGP dressed as DNS

bayindirh

4 months ago

Even when it's not DNS, it's DNS.

nijave

4 months ago

I don't think that's necessarily true. The outage updates later identified failing network load balancers as the cause--I think DNS was just a symptom of the root cause

I suppose it's possible DNS broke health checks but it seems more likely to be the other way around imo

commandersaki

4 months ago

Someone probably failed to lint the zone file.

us0r

4 months ago

Or expired domains which I suppose is related?

dexterdog

4 months ago

That's why they wrote the haiku

indycliff

4 months ago

the answer is always DNS

mlhpdx

4 months ago

Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).

My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.

The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).

Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.

Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.

SteveNuts

4 months ago

> Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted

This seems like such a low bar for 2025, but here we are.

AndrewKemendo

4 months ago

How did you do resilient auth for keys and certs?

zild3d

4 months ago

active/active? curious what the data stack looks like as that tends to be the hard part

chibea

4 months ago

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?

julianozen

4 months ago

There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.

It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.

cyberax

4 months ago

When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?

> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.

IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).

cowsandmilk

4 months ago

Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.

wwdmaxwell

4 months ago

I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.

Both of which seem to prop up in post mortems for these widespread outages.

devmor

4 months ago

I find it very interesting that this is the same issue that took down GCP recently.

sammy2255

4 months ago

Can't resolve any records for dynamodb.us-east-1.amazonaws.com

However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN

curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/

XCSme

4 months ago

It's always DNS

planckscnst

4 months ago

There's also dynamodb-fips.us-east-1.amazonaws.com if the main endpoint is having trouble. I'm not sure if this record was affected the same way during this event.

rstupek

4 months ago

thank you for that info!!!!!

sam1r

4 months ago

Dude!! Life saver.

emrodre

4 months ago

Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.

jamesbelchamber

4 months ago

It's not surprising that it's impacting other services in the region because DynamoDB is one of those things that lots of other services build on top of. It is a little bit surprising that the blast radius seems to extend beyond us-east-1, mind.

In the coming hours/days we'll find out if AWS still have significant single points of failure in that region, or if _so many companies_ are just not bothering to build in redundancy to mitigate regional outages.

I'm looking forward to the RCA!

thmpp

4 months ago

AWS engineers are trained to use their internal services for each new system. They seem to like using DynamoDB. Dependencies like this should be made transparent.

nevada_scout

4 months ago

It's now listing 58 impacted services, so the blast radius is growing it seems

littlecranky67

4 months ago

The same page now says 58 services - just 23 minutes after your post. Seems this is becoming a larger issue.

Aachen

4 months ago

Signal is down from several vantage points and accounts in Europe, I'd guess because of this dependence on Amazon overseas

We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence

JCM9

4 months ago

At 3:03 AM PT AWS posted that things are recovering and sounded like issue was resolved.

Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.

Honestly sounds like AWS doesn’t even really know what’s going on. Not good.

melozo

4 months ago

Even internal Amazon tooling is impacted greatly - including the internal ticketing platform which is making collaboration impossible during the outage. Amazon is incapable of building multi-region services internally. The Amazon retail site seems available, but I’m curious if it’s even using native AWS or is still on the old internal compute platform. Makes me wonder how much juice this company has left.

fairity

4 months ago

As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?

kuon

4 months ago

I realize that my basement servers have better uptime than AWS this year!

I think most sysadmin don't plan for AWS outage. And economically it makes sense.

But it makes me wonder, is sysadmin a lost art?

kalleboo

4 months ago

It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.

It's still missing the one that earned me a phone call from a client.

0x002A

4 months ago

As Amazon moves from day-1 company as it claimed once, to be the sales company like Oracle focusing on raking money, expect more outages to come, and longer to be resolved.

Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.

Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.

rwky

4 months ago

To everyone that got paged (like me), grab a coffee and ride it out, the week can only get better!

ivad

4 months ago

Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....

ibejoeb

4 months ago

This is just a silly anecdote, but every time a cloud provider blips, I'm reminded. The worst architecture I've ever encountered was a system that was distributed across AWS, Azure, and GCP. Whenever any one of them had a problem, the system went down. It also cost 3x more than it should.

abujazar

4 months ago

I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.

rirze

4 months ago

We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight

tonypapousek

4 months ago

Looks like they’re nearly done fixing it.

> Oct 20 3:35 AM PDT

> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.

Waterluvian

4 months ago

I know there's a lot of anecdotal evidence and some fairly clear explanations for why `us-east-1` can be less reliable. But are there any empirical studies that demonstrate this? Like if I wanted to back up this assumption/claim with data, is there a good link for that, showing that us-east-1 is down a lot more often?

mittermayr

4 months ago

Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.

rsanheim

4 months ago

I wonder what kind of outage or incident or economic change will be required to cause a rejection of the big commercial clouds as the default deployment model.

The costs, performance overhead, and complexity of a modern AWS deployment are insane and so out of line with what most companies should be taking on. But hype + microservices + sunk cost, and here we are.

weberer

4 months ago

Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.

JCM9

4 months ago

US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.

Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”

JPKab

4 months ago

The length and breadth of this outage has caused me to lose so much faith in AWS. I knew from colleagues who used to work there how understaffed and inefficient the team is due to bad management, but this just really concerns me.

rose-knuckle17

4 months ago

aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.

Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).

Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.

pjmlp

4 months ago

It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.

Aldipower

4 months ago

My minor 2000 users web app hosted on Hetzner works fyi. :-P

runako

4 months ago

Even though us-east-1 is the region geographically closest to me, I always choose another region as default due to us-east-1 (seemingly) being more prone to these outages.

Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.

me551ah

4 months ago

We created a single point of failure on the Internet, so that companies could avoid single points of failure in their data centers.

jacquesm

4 months ago

Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'

Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

esskay

4 months ago

Er...They appear to have just gone down again.

1970-01-01

4 months ago

Someone, somewhere, had to report that doorbells went down because the very big cloud did not stay up.

I think we're doing the 21st century wrong.

o1o1o1

4 months ago

I'm so happy we chose Hetzner instead but unfortunately we also use Supabase (dashboard affected) and Resend (dashboard and email sending affected).

Probably makes sense to add "relies on AWS" to the criteria we're using to evaluate 3rd-party services.

amadeoeoeo

4 months ago

Oh no... may be LaLiga found out pirates hosting on AWS?

philipp-gayret

4 months ago

Our Alexa's stopped responding and my girl couldn't log in to myfitness pal anymore.. Let me check HN for a major outage and here we are :^)

At least when us-east is down, everything is down.

tedk-42

4 months ago

Internet, out.

Very big day for an engineering team indeed. Can't vibe code your way out of this issue...

stavros

4 months ago

AWS truly does stand for "All Web Sites".

jjice

4 months ago

We got off pretty easy (so far). Had some networking issues at 3am-ish EDT, but nothing that we couldn't retry. Having a pretty heavily asynchronous workflow really benefits here.

One strange one was metrics capturing for Elasticache was dead for us (I assume Cloudwatch is the actual service responsible for this), so we were getting no data alerts in Datadog. Took a sec to hunt that down and realize everything was fine, we just don't have the metrics there.

I had minor protests against us-east-1 about 2.5 years ago, but it's a bit much to deal with now... Guess I should protest a bit louder next time.

igleria

4 months ago

funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.

ksajadi

4 months ago

A lot of status pages hosted by Atlasian StatusPage are down! The irony…

Isuckatcode

4 months ago

Man , I just wanted to enjoy celebrating Diwali with my family but been up from 3am trying to recover our services. There goes some quality time

colesantiago

4 months ago

It seems that all the sites that ask for distributed systems in their interview and has their website down wouldn't even pass their own interview.

This is why distributed systems is an extremely important discipline.

comrade1234

4 months ago

I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.

bob1029

4 months ago

One thing has become quite clear to me over the years. Much of the thinking around uptime of information systems has become hyperbolic and self-serving.

There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.

I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.

The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.

Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.

d_burfoot

4 months ago

I think AWS should use, and provide as an offering to big customers, a Chaos Monkey tool that randomly brings down specific services in specific AZs. Example: DynamoDB is down in us-east-1b. IAM is down in us-west-2a.

Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.

padjo

4 months ago

Friends don’t let friends use us-east-1

polaris64

4 months ago

It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189

__alexs

4 months ago

Is there any data on which AWS regions are most reliable? I feel like every time I hear about an AWS outage it's in us-east-1.

cmiles8

4 months ago

US-East-1 and its consistent problems are literally the Achilles Heel of the Internet.

goinggetthem

4 months ago

This is from Amazon's latest earnings call when Andy Jessy was asked why they aren't growing as much as there competitors

"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area." also "And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."

chermi

4 months ago

Stupid question, why isn't the stock down? Couldn't this lead to people jumping to other providers and at the very least require some pretty big fees for do dramatically breaking SLA? Is it just not a biggest fraction of revenue to matter?

helsinkiandrew

4 months ago

> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.

Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup

shakesbeard

4 months ago

Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.

shinycode

4 months ago

It’s that period of the year when we discover AWS clients that don’t have fallback plans

artyom

4 months ago

Amazon has spent most of its HR post-pandemic efforts in:

• Laying off top US engineering earners.

• Aggressively mandating RTO so the senior technical personnel would be pushed to leave.

• Other political ways ("Focus", "Below Expectations") to push engineering leadership (principal engineers, etc) to leave, without it counting as a layoff of course.

• Terminating highly skilled engineering contractors everywhere else.

• Migrating serious, complex workloads to entry-level employees in cheap office locations (India, Spain, etc).

This push was slow but mostly completed by Q1 this year. Correlation doesn't imply causation? I find that hard to believe in this case. AWS had outages before, but none like this "apparently nobody knows what to do" one.

Source: I was there.

czhu12

4 months ago

Our entire data stack (Databricks and Omni) are all down for us also. The nice thing is that AWS is so big and widespread that our customers are much more understanding about outages, given that its showing up on the news.

hobo_mark

4 months ago

When did Snapchat move out of GCP?

greatgib

4 months ago

When I follow the link, I arrive on a "You broke reddit" page :-o

ctbellmar

4 months ago

Various AI services (e.g. Perplexity) are down as well

bootsmann

4 months ago

Apparently hiring 1000s of software engineers every month was load bearing

renatovico

4 months ago

docker hub or github cache internal maybe is affected:

Booting builder /usr/bin/docker buildx inspect --bootstrap --builder builder-1c223ad9-e21b-41c7-a28e-69eea59c8dac #1 [internal] booting buildkit #1 pulling image moby/buildkit:buildx-stable-1 #1 pulling image moby/buildkit:buildx-stable-1 9.6s done #1 ERROR: received unexpected HTTP status: 500 Internal Server Error ------ > [internal] booting buildkit: ------ ERROR: received unexpected HTTP status: 500 Internal Server Error

amai

4 months ago

The internet was once designed to survive a nuclear war. Nowadays it cannot even survive until tuesday.

saejox

4 months ago

AWS has been the backbone of the internet. It is single point of failure most websites.

Other hosting services like Vercel, package managers like npm, even the docker registeries are down because of it.

mumber_typhoon

4 months ago

>Oct 20 12:51 AM PDT We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.

Weird that case creation uses the same region as the case you'd like to create for.

amai

4 months ago

We are on Azure. But our CI/CD pipelines are failing, because Docker is on AWS.

mcphage

4 months ago

It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.

Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

tonymet

4 months ago

I don't think blaming AWS is fair, since they typically exceed their regional and AZ SLAs

AWS makes their SLAs & uptime rates very clear, along with explicit warnings about building failover / business continuity.

Most of the questions on the AWS CSA exam are related to resiliency .

Look, we've all gone the lazy route and done this before. As usual, the problem exists between the keyboard and the chair.

ta1243

4 months ago

Paying for resilience is expensive. not as expensive as AWS, but it's not free.

Modern companies live life on the edge. Just in time, no resilience, no flexibility. We see the disaster this causes whenever something unexpected happens - the Evergiven blocking Suez for example, let alone something like Covid

However increasingly what should be minor loss of resilience, like an AWS outage or a Crowdstrike incident, turns into major failures.

This fragility is something government needs to legislate to prevent. When one supermarket is out that's fine - people can go elsewhere, the damage is contained. When all fail, that's a major problem.

On top of that, the attitude that the entire sector has is also bad. People thing IT should tail once or twice a year and it's not a problem. If that attitude affect truly important systems it will lead to major civil projects. Any civilitsation is 3 good meals away from anarchy.

There's no profit motive to avoid this, companies don't care about being offline for the day, as long as all their mates are also offline.

bigbuppo

4 months ago

Whose idea was it to make the whole world dependent on us-east-1?

Ekaros

4 months ago

Wasn't the point why AWS is so much premium that you will always get at least 6 nines if not more in availability?

rwke

4 months ago

With more and more parts of our lives depending on often only one cloud infrastructure provider as a single point of failure, enabling companies to have built-in redundancy in their systems across the world could be a great business.

Humans have built-in redundancy for a reason.

JCM9

4 months ago

US-East-1 is literally the Achilles Heel of the Internet.

mrbluecoat

4 months ago

> due to an "operational issue" related to DNS

Always DNS..

t1234s

4 months ago

Do events like this stir conversations in small to medium size businesses to escape the cloud?

binsquare

4 months ago

The internal disruption reviews are going to be fun :)

thomas_witt

4 months ago

Seems to be really only in us-east-1, DynamoDB is performing fine in production on eu-central-1.

nodesocket

4 months ago

Affecting Coinbase[1] as well, which is ridiculous. Can't access the web UI at all. At their scale and importance they should be multi-region if not multi-cloud.

[1] https://status.coinbase.com

jug

4 months ago

Of course this happens when I take a day off from work lol

Came here after the Internet felt oddly "ill" and even got issues using Medium, and sure enough https://status.medium.com

sitzkrieg

4 months ago

it is very funny to me that us-east-1 going down nukes the internet. all those multiple region reliability best practices are for show

michaelcampbell

4 months ago

Anthem Health call center disconnected my wife numerous times yesterday with an ominous robo-message of "Emergency in our call center"; curious if that was this. Seems likely, but what a weird message.

neuroelectron

4 months ago

Sounds like a circular error with monitoring is flooding their network with metrics and logs, causing DNS to fail and produce more errors, flooding the network. Likely root cause is something like DNS conflicts or hosts being recreated on the network. Generally this is a small amount of network traffic but the LBs are dealing with host address flux, causing the hosts to keep colliding host addresses as they attempt to resolve to a new host address which are being lost from dropped packets and with so many hosts in one AZ, there's a good chance they end up with a new conflicting address.

user

4 months ago

[deleted]

mentalgear

4 months ago

> Amazon Alexa: routines like pre-set alarms were not functioning.

It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.

werdl

4 months ago

Looks like a DNS issue - dynamodb.us-east-1.amazonaws.com is failing to resolve.

geye1234

4 months ago

Potentially-ignoramus comment here, apologies in advance, but amazon.com itself appears to be fine right now. Perhaps slower to load pages, by about half a second. Are they not eating (much of) their own dog food?

CTDOCodebases

4 months ago

I'm getting rate limit issues on Reddit so it could be related.

menomatter

4 months ago

What are the design best practices and industry standards for building on-premise fallback capabilities for critical infrastructure? Say for health care/banking ..etc

renegade-otter

4 months ago

If we see more of this, it would not be crazy to assume that all this compelling of engineers to "use AI" and the flood of Looks Good To Me code is coming home.

skywhopper

4 months ago

There are plenty of ways to address this risk. But the companies impacted would have to be willing to invest in the extra operational cost and complexity. They aren’t.

altbdoor

4 months ago

Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.

seviu

4 months ago

I can't log in to my AWS account, in Germany, on top of that it is not possible to order anything or change payment options from amazon.de.

No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.

Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.

raw_anon_1111

4 months ago

From the great Corey Quinn

Ah yes, the great AWS us-east-1 outage.

Half the internet’s on fire, engineers haven’t slept in 18 hours, and every self-styled “resilience thought leader” is already posting:

“This is why you need multi-cloud, powered by our patented observability synergy platform™.”

Shut up, Greg.

Your SaaS product doesn’t fix DNS, you're simply adding another dashboard to watch the world burn in higher definition.

If your first reaction to a widespread outage is “time to drive engagement,” you're working in tragedy tourism. Bet your kids are super proud.

Meanwhile, the real heroes are the SREs duct-taping Route 53 with pure caffeine and spite.

https://www.linkedin.com/posts/coquinn_aws-useast1-cloudcomp...

DanHulton

4 months ago

I forget where I read it originally, but I strongly feel that AWS should offer a `us-chaos-1` region, where every 3-4 days, one or two services blow up. Host your staging stack there and you build real resiliency over time.

(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)

jamesbelchamber

4 months ago

This website just seems to be an auto-generated list of "things" with a catchy title:

> 5000 Reddit users reported a certain number of problems shortly after a specific time.

> 400000 A certain number of reports were made in the UK alone in two hours.

mannyv

4 months ago

This is why we use us-east-2.

munchlax

4 months ago

Nowadays when this happens it's always something. "Something went wrong."

Even the error message itself is wrong whenever that one appears.

comp_throw7

4 months ago

We're seeing issues with RDS proxy. Wouldn't be surprised if a DNS issue was the cause, but who knows, will wait for the postmortem.

twistedpair

4 months ago

Wow, about 9 hours later and 21 of 24 Atlassian services are still showing up as impacted on their status page.

Even @ 9:30am ET this morning, after this supposedly was clearing up, my doctor's office's practice management software was still hosed. Quite the long tail here.

https://status.atlassian.com/

0xbadcafebee

4 months ago

We never went down in us-east-1 during this incident. We have tons of high-traffic sites/services. Not multi-region, not multi-cloud.

You're gonna hear mostly complaints in this thread, but simple, resilient, single-region architecture is still reliable as hell in AWS, even in the worst region.

fujigawa

4 months ago

Appears to have also disabled that bot on HN that would be frantically posting [dupe] in all the other AWS outage threads right about now.

AtNightWeCode

4 months ago

Considering the history of east-1 it is fascinating that it still causes so many single point of failure incidents for large enterprises.

socalgal2

4 months ago

Amazon itself apperas to be out for some products. I get a "Sorry, We couldn't find that page" when clicking on products

the-chitmonger

4 months ago

I'm not sure if this is directly related, but I've noticed my Apple Music app has stopped working (getting connection error messages). Didn't realize the data for Music was also hosted on AWS, unless this is entirely unrelated? I've restarted my phone and rebooted the app to no avail, so I'm assuming this is the culprit.

ngruhn

4 months ago

Can't login to Jira/Confluence either.

aeon_ai

4 months ago

It's not DNS

There's no way it's DNS

It was DNS

moralestapia

4 months ago

Curious to know how much does an outage like this cost to others.

Lost data, revenue, etc.

I'm not talking about AWS but whoever's downstream.

Is it like 100M, like 1B?

itqwertz

4 months ago

Did they try asking Claude to fix these issues? If it turns out this problem is AI-related, I'd love to see the AAR.

nullorempty

4 months ago

It won't be over until long after AWS resolves it - the outages produce hours of inconsistent data. It especially sucks for financial services, things of eventual consistency and other non-transactional processes. Some of the inconsistencies introduced today will linger and make trouble for years.

karel-3d

4 months ago

Slack is down. Is that related? Probably is.

raspasov

4 months ago

02:34 Pacific: Things seem to be recovering.

rickette

4 months ago

Couple of years ago us-east was considered the least stable region here on HN due to its age. Is that still a thing?

lexandstuff

4 months ago

Yes, we're seeing issues with Dynamo, and potentially other AWS services.

Appears to have happened within the last 10-15 minutes.

wcchandler

4 months ago

This is usually something I see on Reddit first, within minutes. I’ve barely seen anything on my front page. While I understand it’s likely the subs I’m subscribed to, that was my only reason for using Reddit. I’ve noticed that for the past year - more and more tech heavy news events don’t bubble up as quickly anymore. I also didn’t see this post for a while for whatever reason. And Digg was hit and miss on availability for me, and I’m just now seeing it load with an item around this.

I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.

webdoodle

4 months ago

I in-housed an EMR for a local clinic because of latency and other network issues taking the system offline several times a month (usually at least once a week). We had zero downtime the whole first year after bringing it all in house, and I got employee of the month for several months in a row.

whatsupdog

4 months ago

I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.

Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.

stego-tech

4 months ago

Not remotely surprised. Any competent engineer knows full well the risk of deploying into us-east-1 (or any “default” region for that matter), as well as the risks of relying on global services whose management or interaction layer only exists in said zone. Unfortunately, us-east-1 is the location most outsourcing firms throw stuff, because they don’t have to support it when it goes pear-shaped (that’s the client’s problem, not theirs).

My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.

glemmaPaul

4 months ago

LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?

thundergolfer

4 months ago

This is widespread. ECR, EC2, Secrets Manager, Dynamo, IAM are what I've personally seen down.

rafa___

4 months ago

"Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1..."

It's always DNS...

lsllc

4 months ago

The Ring (Doorbell) App isn't working, nor is any the MBTA (Transit) Status pages/apps.

bgwalter

4 months ago

Probably related:

https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...

"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."

Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.

l33tnull

4 months ago

I can't do anything for school because Canvas by Instructure is down because of this.

rdm_blackhole

4 months ago

My app deployed on Vercel and therefore indirectly deployed on us-east-1 was down for about 2 hours today then came back up and then went down again 10 minutes ago for 2 or 3 minutes. It seems like they are still intermittent issues happening.

sam1r

4 months ago

Chime has completely been down for almost 12 hours.

Impacting all banking series with red status error. Oddly enough, only their direct deposits are functioning without issues.

https://status.chime.com/

fsto

4 months ago

Ironically, the HTTP request to this article timed out twice before a successful response.

shawn_w

4 months ago

One of the radio stations I listen to is just dead air tonight. I assume this is the cause.

ssehpriest

4 months ago

Airtable is down as-well.

A lot of businesses have all their workflows depending on their data on airtable.

qrush

4 months ago

AWS's own management console sign-in isn't even working. This is a huge one. :(

moribvndvs

4 months ago

So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.

okr

4 months ago

Btw. we had a forced EKS restart last week on thursday due to Kubernetes updates. And something was done with DNS there. We had problems with ndots. Caused some trouble here. Would not be surprised, if it is related, heh.

assimpleaspossi

4 months ago

I'm thinking about that one guy who clicked on "OK" or hit return.

EbNar

4 months ago

May be because of this that trying to pay with PayPal on Lenovo's website has failed thrice for me today? Just asking... Knowing how everything is connected nowadays it wouldn't surprise me at all.

karel-3d

4 months ago

Slack was down, so I thought I will send message to my coworkers on Signal.

Signal was also down.

ronakjain90

4 months ago

we[1] operate out of `us-east-1` but chose to not use any of the cloud based vendor lockin (sorry vercel, supabase, firebase, planetscale etc). Rather a few droplets in DigitalOcean(us-east-1) and Hetzner(eu). We serve 100 million requests/mo, few million user generated content(images)/mo at monthly cost of just about $1000/mo.

It's not difficult, it's just that we engineers chose convenience and delegated uptime to someone else.

[1] - https://usetrmnl.com

littlecranky67

4 months ago

Maybe unrelated, but yesterday I went to pick up my package from an Amazon Locker in Germany, and the display said "Service unavailable". I'll wait until later today before I go and try again.

AbstractH24

4 months ago

Are there websites that do post-mortems for how the single points of failure impacted the entire internet?

Not just AWS, but Cloudflare and others too. Would be interesting to review them clinically.

dabinat

4 months ago

My site was down for a long time after they claimed it was fixed. Eventually I realized the problem lay with Network Load Balancers so I bypassed them for now and got everything back up and running.

mslm

4 months ago

Happened to be updating a bunch of NPM dependencies and then saw `npm i` freeze and I'm like... ugh what did I do. Then npm login wasn't working and started searching here for an outage, and wala.

fogzen

4 months ago

Great. Hope they’re down for a few more days and we can get some time off.

00deadbeef

4 months ago

It's not DNS

There's no way it's DNS

It was DNS

pageandrew

4 months ago

Can't even get STS tokens. RDS Proxy is down, SQS, Managed Kafka.

codebolt

4 months ago

Atlassian cloud is also having issues. Closing in on the 3 hour mark.

bstsb

4 months ago

glad all my services are either Hetzner servers or EU region of AWS!

t0lo

4 months ago

It's weird that we're living in a time where this could be a taste of a prolonged future global internet blackout by adversarial nations. Get used to this feeling I guess :)

grk

4 months ago

Does anyone know if having Global Accelerator set up would help right now? It's in the list of affected services, I wonder if it's useful in scenarios like this one.

jrm4

4 months ago

Hey wait wasn't the internet supposed to route around...?

aaronbrethorst

4 months ago

My ISP's DNS servers were inaccessible this morning. Cloudflare and Google's DNS servers have all been working fine, though: 1.1.1.1, 1.0.0.1, and 8.8.8.8

roosgit

4 months ago

Can confirm. I was trying to send the newsletter (with SES) and it didn't work. I was thinking my local boto3 was old, but I figured I should check HN just in case.

disposable2020

4 months ago

I seem to recall other issues around this time in previous years. I wonder if this is some change getting shoe-horned in ahead of some reinvent release deadline...

draxil

4 months ago

I was just about to post that it didn't affect us (heavy AWS users, in eu-west-1). Buut, I stopped myself because that was just massively tempting fate :)

toephu2

4 months ago

Half the internet goes down because part of AWS goes down... what happened to companies having redundant systems and not having a single point of failure?

vivzkestrel

4 months ago

stupid question: is buying a server rack and running it at home subject to more downtimes in a year than this? has anyone done an actual SLA analysis?

spwa4

4 months ago

Reddit seems to be having issues too:

"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"

user

4 months ago

[deleted]

tomaytotomato

4 months ago

Slack, Jira and Zoom are all sluggish for me in the UK

twistedpair

4 months ago

I just saw services that were up since 545AM ET go down around 12:30PM ET. Seems AWS has broken Lambda again in their efforts to fix things.

megous

4 months ago

I didn't even notice anything was wrong today. :) Looks like we're well disconnected from the US internet infra quasi-hegemony.

hexbin010

4 months ago

Why after all these years is us-east-1 such a SPOF?

klon

4 months ago

Statuspage.io seems to load (but is slow) but what is the point if you can't post an incident because Atlassian ID service is down.

yuvadam

4 months ago

During the last us-east-1 apocalypse 14 years ago, I started awsdowntime.com - don't make me regsiter it again and revive the page.

alvis

4 months ago

Why would us-east-1 cause many UK banks and even UK gov web sites down too!? Shouldn't they operate in the UK region due to GDPR?

BiraIgnacio

4 months ago

It's scary to think about how much power and perhaps influence the AWS platform has. (albeit it shouldn't be surprising)

montek01singh

4 months ago

I cannot create a support ticket with AWS as well.

rob

4 months ago

My Alexa is hit or miss at responding to queries right now at 5:30 AM EST. Was wondering why it wasn't answering when I woke up.

danielpetrica

4 months ago

In this moments I think devs should invest in vendor independence if they can. While I'm not to that stage yet (cloudlfare dependence) using open technologies like docker (or Kubernetes), Traefik instead of managed services can help in this disaster situations by switching to a different provider in a faster way than having to rebuild from zero. as a disclosure I'm not still to that point on my infrastructure But I'm trying to slowly define one for my self

hipratham

4 months ago

Strangely some of our services are scaling up on east-1, and there is downtick on downdetector.com so issue might be resolving.

donmb

4 months ago

Asana down Postman workspaces don't load Slack affected And the worst: heroku scheduler just refused to trigger our jobs

jpfromlondon

4 months ago

This will always be a risk when sharecropping.

Danborg

4 months ago

r/aws not found

There aren't any communities on Reddit with that name. Double-check the community name or start a new community.

user

4 months ago

[deleted]

world2vec

4 months ago

Slack and Zoom working intermittently for me

jcmeyrignac

4 months ago

Impossible to connect to JIRA here (France).

kedihacker

4 months ago

Only us east 1 gets new services immediately others might do but not a guarantee. Which regions are a good alternative

user

4 months ago

[deleted]

pardner

4 months ago

Darn, on Heroku even the "maintenance mode" (redirects all routes to a static url) won't kick in.

user

4 months ago

[deleted]

antihero

4 months ago

My website on the cupboard laptop is fine.

cpfleming

4 months ago

Seems to be upsetting Slack a fair bit, messages taking an age to send and OIDC login doesn't want to play.

devttyeu

4 months ago

Can't update my selfhosted HomeAssistant because HAOS depends on dockerhub which seems to be still down.

suralind

4 months ago

I wonder how their nines are going. Guess they'll have to stay pretty stable for the next 100 years.

user

4 months ago

[deleted]

magnio

4 months ago

npm and pnpm are badly affected as well. Many packages are returning 502 when fetched. Such a bad time...

kevinsundar

4 months ago

AWS pros know to never use us-east-1. Just don't do it. It is easily the least reliable region

homeonthemtn

4 months ago

"We should have a fail back to US-West."

"It's been on the dev teams list for a while"

"Welp....."

tdiff

4 months ago

That strange feeling of the world getting cleaner for a while without all these dependant services.

dude250711

4 months ago

They are amazing at LeetCode though.

nik736

4 months ago

Twilio seems to be affected as well

busymom0

4 months ago

For me Reddit is down and also the amazon home page isn't showing any items for me.

ares623

4 months ago

Did someone vibe code a DNS change

1970-01-01

4 months ago

Completely detached from reality, AMZN has been up all day and closed up 1.6%. Wild.

sinpor1

4 months ago

His influence is so great that it caused half of the internet to stop working properly.

lawlessone

4 months ago

Am i imagining it or are more things like this happening in recent weeks than usual?

YouAreWRONGtoo

4 months ago

I don't get how you can be a trillion dollar company and still suck this much.

nivekney

4 months ago

Wait a second, Snapchat impacted AGAIN? It was impacted during the last GCP outage.

codegladiator

4 months ago

They haven't listed SES there yet in the affected services on their status page

TrackerFF

4 months ago

Lots of outage happening in Norway, too. So I'm guessing it is a global thing.

IOT_Apprentice

4 months ago

Apparently IMDb, an Amazon service is impacted. LOL, no multi region failover.

countWSS

4 months ago

Reddit itself breaking down and errors appear. Does reddit itself depends on this?

bpye

4 months ago

Amazon.ca is degraded, some product pages load but can't see prices. Amusing.

motbus3

4 months ago

Always a lovely Monday when you wake just in time to see everything going down

pmig

4 months ago

Thanks god we built all our infra on top of EKS, so everything works smoothly =)

assimpleaspossi

4 months ago

As of 4:26am Central Time in the USA, it's back up for one of my services.

sph

4 months ago

10:30 on a Monday morning and already slacking off. Life is good. Time to touch grass, everybody!

kkfx

4 months ago

Honestly anyone do have outages, that's nothing extraordinary, what's wrong is the number of impacted services. We choose (at least almost choose) to ditch mainframes for clusters also for resilience. Now with cheap desktop iron labeled "stable enough to be a serious server" we have seen mainframes re-created sometimes with a cluster of VM on top of a single server, sometimes with cloud services.

Ladies and Gentleman's it's about time to learn reshoring in the IT world as well. Owning nothing, renting all means extreme fragility.

bilekas

4 months ago

These things happen when profits are the measure everything. Change your provider, but if their number doesn't go up, they wont be reliable.

So your complaints matter nothing because "number go up".

I remember the good old days of everyone starting a hosting company. We never should have left.

ZeWaka

4 months ago

Alexa devices are also down.

Ygg2

4 months ago

Ironically enough I can't access Reddit due to no healthy upstream.

valdiorn

4 months ago

I missed a parcel delivery because a computer server in Virginia, USA went down, and now the doorbell on my house in England doesn't work. What. The. Fork.

How the hell did Ring/Amazon not include a radio-frequency transmitter for the doorbell and chime? This is absurd.

To top it off, I'm trying to do my quarterly VAT return, and Xero is still completely borked, nearly 20 hours after the initial outage.

jodrellblank

4 months ago

Another time to link The Machine Stops by E.M. Forster, 1909: https://web.cs.ucdavis.edu/~rogaway/classes/188/materials/th...

> “The Machine,” they exclaimed, “feeds us and clothes us and houses us; through it we speak to one another, through it we see one another, in it we have our being. The Machine is the friend of ideas and the enemy of superstition: the Machine is omnipotent, eternal; blessed is the Machine.”

..

> "she spoke with some petulance to the Committee of the Mending Apparatus. They replied, as before, that the defect would be set right shortly. “Shortly! At once!” she retorted"

..

> "there came a day when, without the slightest warning, without any previous hint of feebleness, the entire communication-system broke down, all over the world, and the world, as they understood it, ended."

gritzko

4 months ago

idiocracy_window_view.jpg

Liftyee

4 months ago

Damn. This is why Duolingo isn't working properly right now.

TrackerFF

4 months ago

Lots of outage in Norway, started approximately 1 hour ago for me.

ryanmcdonough

4 months ago

Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?

user

4 months ago

[deleted]

sineausr931

4 months ago

On a bright note, Alexa has stopped pushing me merchandise.

ta1243

4 months ago

Meanwhile my pair of 12 year old raspberry pi's hangling my home services like DNS survive their 3rd AWS us-east-1 outage.

"But you can't do webscale uptime on your own"

Sure. I suspect even a single pi with auto-updates on has less downtime.

nokeya

4 months ago

Serverless is down because servers are down. What an irony.

j45

4 months ago

More and more I want to be could agnostic or multi-cloud.

redeux

4 months ago

It’s a good day to be a DR software company or consultant

arrty88

4 months ago

I expect gcp and azure to gain some customers after this

nla

4 months ago

I still don't know why anyone would use AWS hosting.

chistev

4 months ago

What is HN hosted on?

croemer

4 months ago

Coinbase down as well

ecommerceguy

4 months ago

Just tried to get into Seller Central, returned a 504.

empressplay

4 months ago

Can't check out on Amazon.com.au, gives error page

XorNot

4 months ago

Well that takes down Docker Hub as well it looks like.

hippo77

4 months ago

Finally an upside to running on Oracle Cloud!

testemailfordg2

4 months ago

Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.

user

4 months ago

[deleted]

a-dub

4 months ago

i am amused at how us-east-1 is basically in the same location as where aol kept its datacenters back in the day.

thecopy

4 months ago

I did get 500 error from their public ECR too

jimrandomh

4 months ago

The RDS proxy for our postgres DB went down.

bicepjai

4 months ago

Is this the outage that took Medium down ?

BaudouinVH

4 months ago

canva.com was down until a few minutes ago.

codebolt

4 months ago

Atlassian cloud is having problems as well.

_pvzn

4 months ago

Can confirm, also getting hit with this.

al_james

4 months ago

Cant even login via the AWS access portal.

chasd00

4 months ago

wow I think most of Mulesoft is down, that's pretty significant in my little corner of the tech world.

seanieb

4 months ago

Clearly this is all some sort of mass delusion event, the Amazon Ring status says everything is working.

https://status.ring.com/

(Useless service status pages are incredibly annoying)

mk89

4 months ago

It's fun to see SRE jumping left and right when they can do basically nothing at all.

"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.

They can't even use Slack to communicate - messages are being dropped/not sent.

And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.

What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.

mmmlinux

4 months ago

Ohno, not Fortnite! oh, the humanity.

t0lo

4 months ago

Can't log into tidal for my music

hubertzhang

4 months ago

I cannot pull images from docker hub.

goodegg

4 months ago

Terraform Cloud is having problem too

mpcoder

4 months ago

I can't even see my EKS clusters

zoklet-enjoyer

4 months ago

Is this why Wordle logged me out and my 2 guesses don't seem to have been recorded? I am worried about losing my streak.

tosh

4 months ago

SES and signal seem to work again

htrp

4 months ago

thundering herd problems.... every time they say they fix it something else breaks

8cvor6j844qw_d6

4 months ago

That's unusual.

I wss under the impression that having multiple available zones guarantees high availability.

It seems this is not the case.

fastball

4 months ago

One of my co-workers was woken up by his Eight Sleep going haywire. He couldn't turn it off because the app wouldn't work (presumably running on AWS).

teunlao

4 months ago

us-east-1 down again. We all know we should leave. None of us will.

president_zippy

4 months ago

I wonder how much better the uptime would be if they made a sincere effort to retain engineering staff.

Right now on levels.fyi, the highest-paying non-managerial engineering role is offered by Oracle. They might not pay the recent grads as well as Google or Microsoft, but they definitely value the principal engineers w/ 20 years of experience.

zwnow

4 months ago

I love this to be honest. Validates my anti cloud stance.

ryanmcdonough

4 months ago

Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?

bdangubic

4 months ago

you put your sh*t in us-east-1 you need to plan for this :)

askonomm

4 months ago

Docker is also down.

worik

4 months ago

This outage is a reminder:

Economic efficiency and technical complexity are both, separately and together, enemies of resilience

goodegg

4 months ago

Happy Monday People

ArcHound

4 months ago

Good luck to all on-callers today.

It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).

ktosobcy

4 months ago

Uhm... E(U)ropean sovereigny (and in general spreading the hosting as much as possbile) needed ASAP…

tosh

4 months ago

seeing issues with SES in us-east-1 as well

kitd

4 months ago

O ffs. I can't even access the NYT puzzles in the meantime ... Seriously disrupted, man

grenran

4 months ago

seems like services are slowly recovering

ivape

4 months ago

There are entire apps like Reddit that are still not working. What the fuck is going on?

redwood

4 months ago

Surprising and sad to see how many folks are using DynamoDB There are more full featured multi-cloud options that don't lock you in and that don't have the single point of failure problems.

And they give you a much better developer experience...

Sigh

pantulis

4 months ago

Now I know why the documents I was sending to my Kindle didn't go through.

iwontberude

4 months ago

worst outage since xmas time 2012

thinkindie

4 months ago

Today’s reminder: multi-region is so hard even AWS can’t get it right.

LightBug1

4 months ago

Remember when the "internet will just route around a network problem"?

FFS ...

solatic

4 months ago

And yet, AMZN is up for the day. The market doesn't care. Crazy.

JCharante

4 months ago

Ring is affected. Why doesn’t Ring have failover to another region?

Aldipower

4 months ago

altavista.com is also down!

chaidhat

4 months ago

is this why docker is down?

SergeAx

4 months ago

How much longer are we going to tolerate this marketing bullshit about "Designed to provide 99.999999999% durability and 99.99% availability"?

user

4 months ago

[deleted]

jumploops

4 months ago

"Never choose us-east-1"

aiiizzz

4 months ago

Slack was acting slower than usual, but did not go down. Color me impressed.

dangoodmanUT

4 months ago

Reminder that AZs don't go down

Entire regions go down

Don't pay for intra-az traffic friends

throw-10-13

4 months ago

imagine spending millions on devops and sre to still have your mission critical service go down because amazon still has baked in regional dependencies

jdlyga

4 months ago

Time to start calling BS on the 9's of reliability

DataDaemon

4 months ago

But but this is a cloud, it should exist in the cloud.

martinheidegger

4 months ago

Designed to provide 99.999% durability and 99.999% availability Still designed, not implemented

xodice

4 months ago

Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.

hvb2

4 months ago

First, not all outages are created equal, so you cannot compare them like that.

I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.

But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.

AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.

xodice

4 months ago

If AWS fully decentralized its control planes, they’d essentially be duplicating the cost structure of running multiple independent clouds and I understand that is why they don't however as long as AWS is reliant upon us-east-1 to function, they have not achieved what they claim to me. A single point of failure for IAM? Nah, no thanks.

Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.

That's poor design, after all these years. They've had time to fix this.

Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...

If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.

hvb2

4 months ago

You can only decentralized your control plane if you don't have conflicting requirements?

Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem

xodice

4 months ago

In the narrow distributed-systems sense? Yes, however those requirements are self-imposed. AWS chose strong global consistency for IAM and billing... they could loosen it at enormous expense.

The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.

I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.

The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.

In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.

AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.

nemo44x

4 months ago

Someone’s got a case of the Monday’s.

grebc

4 months ago

Good thing hyperscalers provide 100% uptime.

mcbain

4 months ago

When have they ever claimed that?

grebc

4 months ago

Plenty of people last week here claiming hyperscalers are necessary while ignoring more normal hosting options.

BartjeD

4 months ago

So much for the peeps claiming amazing Cloud uptime ;)

JackSlateur

4 months ago

Could you give us that uptime, in number ?

BartjeD

4 months ago

I'm afraid you missed the emoticon at the end of the sentence.

A `;)` is normally understoond to mean the author isn't entirely serious, and is making light of something or other.

Perhaps you American downvoters were on call and woke up with a fright, and perhaps too much time to browse Hacker News. ;)

dorongrinstein

4 months ago

Anyone needing multi-cloud WITH EASE, please get in touch. https://controlplane.com

I am the CEO of the company and started it because I wanted to give engineering teams an unbreakable cloud. You can mix-n-match services of ANY cloud provider, and workloads failover seamlessly across clouds/on-prem environments.

Feel free to get in touch!

avi_vallarapu

4 months ago

This is the reason why it is important to plan Disaster recovery and also plan Multi-Cloud architectures.

Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.

Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.

- Qlik replicate - HexaRocket

and some more.

Or rather implement native replication solutions available with data platforms.