hackernews client

Ask HN: Why are most status pages delayed?

45 pointsposted 7 hours ago

Item id: 45810127

50 Comments

swiftcoder

6 hours ago

Because for most major sites, updating the status page requires (a significant number of) humans in the loop.

Back when I worked at a major cloud provider (which admittedly was >5 years ago), our alarms would go off after ~3-15 minutes of degraded functionality (depending on the sensitivity settings of that specific alarm). At that point the on call gets paged in to investigate and validates that the issue is real (and not trivially correctable). There was also automatic escalation if the on call doesn't acknowledge the issue after 15 minutes.

If so, a manager gets paged in to coordinate the response, and if the manager considers the outage to be serious (or to affect a key customer), a director or above gets paged in. The director/VP has the ultimate say about posting an outage, but they in parallel consult the PR/comms team to consult on the wording/severity of the notification, any partnership managers for key affected clients, and legal re any contractual requirements the outage may be breaching...

So in a best-case scenario you'd have 3 minutes (for a fast alarm to raise) plus ~5 minutes for the on call to engage, plus ~10 minutes of initial investigation, plus ~20 minutes of escalations and discussions... all before anyone with permission to edit the status page can go ahead and do so

mirekrusin

4 hours ago

So there is access to "degraded functionality" from start (the "3-15" of "degraded functionality" one) - people are asking why not share THAT then?

Nobody cares about internal escalations, if manager is taking shit or not - that's not service status, that's internal dealing with the shit process - it can surface as extra timestamped comments next to service STATUS.

swiftcoder

4 hours ago

> why not share THAT then?

When you've guaranteed 4 or 5 nines worth of uptime to the customer, every acknowledged outage results in refunds (and potentially being sued over breach of contract)

xigoi

2 hours ago

On the other hand, if they’re down but don’t report it, couldn’t they be sued for fraud?

jakevoytko

4 hours ago

Because the systems are so complex and capable of emergent behavior that you need a human in the loop to truly interpret behavior and impact. Just because an alert is going off doesn't mean that the alert was written properly, or is measuring the correct thing, or the customer is interpreting its meaning correctly, etc.

2gremlin181

6 hours ago

Copying my response over from another comment:

I totally get that, but how hard would it be to actually make calls to your own API from the status page? If it fails, display a vague message saying there might be issues and that you are looking into it. Clearly these metrics and alerts exist internally too. I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.

rozenmd

5 hours ago

There are increasingly more status pages that automatically update based on uptime data (I built a service providing that - OnlineOrNot)

But early-stage startups typically have engineering own the status page, but as they grow, ownership usually transfers to customer support. These teams optimize for controlling the message rather than technical detail, which explains the shift toward vaguer/slower incident descriptions.

Yeri

6 hours ago

Because you'd have a ton of downtime and they'd rather hide it if they could. :)

I used to work at a very big cloud service provider, and as the initial comment mentioned, we'd get a ton of escalations/alerts in a day, but the majority didn't necessarily warrant a status page update (only affecting X% of users, or not 'major' enough, or not having any visible public impact).

I don't really agree with that, but that was how it was. A manger would decide whether or not to update the status page, the wording was reviewed before being posted, etc. All that takes a lot of time.

swiftcoder

4 hours ago

Not hard at all (our internal dashboards did just that). But to have that data posted publicly was not in the best interests of the business.

And honestly, having been on a few customer escalations where they threatened legal action over outages, one kind of starts to see things the business way...

dvt

5 hours ago

> Just stop gaslighting me.

I heard this years ago from someone, but there's material impact to a company's bottom line if those pages get updated, so that's why someone fairly senior has to usually "approve" it. Obviously it's technically trivial, but if they acknowledge downtime (for example, like in the AWS case), investors will have questions, it might make quarterly reports, and it might impact stock price.

So it's not just a "status page," it's an indicator that could affect market sentiment, so there's a lot of pressure to leave everything "green" until there's no way to avoid it.

FinnKuhn

5 hours ago

I feel like there should at least be some sort of disclaimer then that tells me the status page can take up to xx minutes to show an outage and not make it seem as if it is updated instantaniously. That way I could way those xx minutes before I file a ticket with support and not have the case thinking it is an isolated problem for me instead of a major outage.

Bender

7 hours ago

Bureaucracy. Companies have service level agreements with other companies. They want to be damn sure they can not disavow an outage before something says there is an outage. There will in most cases be a process involved in updating the status page that will intentionally have many layers of bureaucracy hurdles to jump through including many approvals. The preference will often be to downgrade an "outage" to a "degradation" or "partial outage" or some other term to downplay it and avoid having to pay credits on their B2B service level agreements and such.

mirekrusin

5 hours ago

Because they're incentivized to delay it, ideally until resolved, this way their SLA uptime is 100%. Less of reported downtime is better for them so they push it as much as possible. If they were to report all failures their pretty green history would be filled with red. What, are you going to do, sue them? They can do it so they do.

anomaloustho

5 hours ago

It’s already been said, but most companies already have those instant “alarms” that go off within minutes. 80% of the time, those alarms are red herrings that get triaged. At a lot of companies, they go off constantly.

As a company, you don’t want to declare an outage readily and you definitely don’t want it to be declared frequently. Declaring an outage frequently means:

• Telling your exec team that your department is not running well • Negative signal to your investors • Bad reputation with your customers • Admitting culpability to your customers and partners (inviting lawsuits and refunds) • Telling your engineering leadership team that your specific team isn’t running well • Messing up your quarterly goals, bonuses etcetera for outages that aren’t real

So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real. You want to make sure you get it right. Therefore, you don’t just want to flip a status page because a few API calls had a timeout.

FinnKuhn

5 hours ago

>So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real.

I would argue that every social and incentive structure along the way basically signals that you don't want to declare an outage, even when it is real. You should still do it though or it becomes meaningless.

Great example for Goodhart's law.

gwbas1c

5 hours ago

Just wanted to chime in that, at my company, we have some policies that impact when we actually update our status page to show that we have an outage. Without going into details, the policies deliberately slow down our reporting of downtime: We (engineering) need to have a clear understanding of what the problem is before we say there is a problem publicly.

I've personally challenged some details in these policies, which I won't discuss publicly. What I generally agree with is that it's important to have a human in the loop, and to be very thoughtful about when to update a status page and what is put there.

jpalawaga

5 hours ago

It’s not a technical issue, it’s a business one.

Those status pages are often linked to contractual SLAs and updating the page tangibly means money lost.

So there’s an incentive to only up it when the issue is severe and not quickly remediated.

It’s not an engineers tool, it’s a liability tool.

digitalsushi

4 hours ago

Imagine what you could get away with if you owned the ledger of truth. Any time you made an offence, you could just update that ledger to say the people complaining are wrong, and that's the end of it.

I feel that the tech industry does not have sole ownership of this powerful tool

spwa4

2 hours ago

And in sufficiently large contracts this is usually agreed upon. Also gives large cloud companies an incentive to "invest" in uptime monitoring companies.

bithaze

7 hours ago

> Why even have a status page if it is not going to be accurate in real time?

The funny thing is reddit's status page used to have real-time graphs of things like error rate, comment backlog, visits, etc. Not with any numbers on the Y-axis, so you could only see relative changes, really, but they were still helpful to see changes before humans got around to updating the status page.

xeonmc

7 hours ago

Status pages can be replaced with a webcam feed of a whiteboard with post-it notes manually updated by employees.

bberenberg

6 hours ago

Alert fatigue. Down Detector will show an outage with a service when the intermediate network is down. Companies have to triage alerts and once they’re validated they are posted on a status page. Some companies abuse this to hide their outages. Others delay in a reasonable manner.

I have considered building something to address this and even own honeststatuspage.com to eventually host it on. But it’s a complex problem without an obviously correct answer.

giancarlostoro

6 hours ago

I've seen all sorts of Azure outages that never wind up on their status page. Granted they could be unique to a small pool of services.

exasperaited

6 hours ago

Yeah. Down Detector is more or less meaningless unless something massive has happened, and as you say it has terrible consequences for knock-on services.

It's not even just intermediate networks, it's sometimes direct coinnections. For example, a flood of people reporting an outage on mobile phone network X when the problem they are experiencing is not being able to call a loved one who is on phone network Y, which is the one that is down. This happened a little while back in the UK, leading the other phone providers to have to deny there was some broad outage (which is not an easy thing to reassure when there are so many MVNOs sharing network Y)

chrismorgan

5 hours ago

A few months ago, Cloudflare accidentally turned off 1.1.1.1 (I’m simplifying slightly, most notably DNS-over-HTTPS continued to work). Over the course of five or six minutes, traffic dropped to 10% of normal, and stayed there. Somehow, it took another six minutes before an alert fired, at which point they noticed.

https://news.ycombinator.com/item?id=44578490

You’d think that for such a company they’d notice if global traffic for one of their important services for a given minute had dropped below 50% compared with the last hour, but apparently not.

And that’s Cloudflare, who I would expect better of than most.

onionisafruit

4 hours ago

At my company we will notice that traffic drops significantly for a minute, but thanks to reporting latency we don’t get alerted until a few minutes later. In our business and at our scale, that latency is fine because we aren’t a vital internet service.

edit: I should have followed your link before commenting, because this sentiment is well covered there.

ntomas

3 hours ago

I think it all has to do with how companies react to their own outages and their processes around publishing the info. I imagine that bigger companies need to go through a process to validate all the information they share with the public.

I don't think it's a factor in how Statuspage works. Cloudflare, for example, uses them, and usually it's pretty fast to update their status page and release outage information.

For companies that need to monitor critical dependencies, my company ( https://isdown.app ) helps by aggregating status page information with crowdsourced reports. This way, companies can be alerted way sooner than when the status page is updated.

zokier

6 hours ago

> It's also not uncommon that smaller issues never get acknowledged.

You kinda answered your own question here. The intent of the status pages is to report any major issues and not every small glitch.

Mojah

4 hours ago

Most companies prefer to fix any downtime before it's noticed, and sharing any details on a status page means admitting something went wrong.

There's plenty of status page solutions that tie in uptime monitoring with status updates, essentially providing a "if we get an alert, anyone can follow along through the status page" for near real-time updates. But, it means showing _all_ users that something went wrong, when maybe only a handful noticed it in the first place.

It's a flawed tactic to try and hide/dismiss any downtime (people will notice), but it's in our human nature to try and hide the bad things?

[1] ie https://ohdear.app/features/status-pages

fouc

5 hours ago

I agree, literally the definition of a status page is:

"A status page is used to communicate real-time information about a company's system health, performance, and any ongoing incidents to users. It helps reduce support tickets, improve transparency, and build trust by keeping users informed during outages or maintenance"

real-time. for multiple good reasons. reduces confusion for everyone.

hypeatei

6 hours ago

Why even check the status page at all if you're experiencing errors and others are reporting the same? I don't see the point in getting worked up over how long it takes a company to update their status page.

There is a ton of moving pieces in software these days and networks in general. There is no straightforward way to declare an outage via health checks, especially if declaring said outage can cost you $$ due to various SLAs. Manual verification+updates take time.

fouc

5 hours ago

By definition, status pages should reflect reality, and be real-time, it makes sense to get worked up over because we rely on the company's status page to be the ultimate arbiter of the reality of their servers. It's not always easily obvious if it's on our end or theirs. Even if some other third party checker is showing problems, it doesn't mean they're 100% accurate either.

pixl97

5 hours ago

Because it takes work to figure out if others are having issues too or it's you with the problem. You're checking somethings status somewhere.

dgeiser13

an hour ago

Most service status pages are Layer 8 in the OSI Model

sd9

5 hours ago

I’ve seen “You broke Reddit” too much recently. Today, and during the AWS outage.

I didn’t break Reddit by trying to access the homepage.

kachapopopow

6 hours ago

because these systems are so big and the people who can validate problems might be asleep at the wheel or be pretty far up the chain and it takes time to reach it. most of the spikes on downdetector are often unrelated to the service, but a 3rd party failure.

2gremlin181

6 hours ago

IMO if you have an endpoint or service on your status page, you most definitely have an oncall rotation for it. Regarding the second point, your service might be down due to an AWS outage. It's an upstream issue and I fully understand that but I should not have to track things upstream by guessing what cloud provider your use. Where do we draw the line too? What if its not AWS but Hetzner or some other boutique provider?

kachapopopow

4 hours ago

well usually you have no way to even validate the issue if is due to a bad route and giving out an inaccurate status report is poorly reflected on a pristine:tm: status page. also status updates send out (in some cases) millions of notifications so (global) notifications are only reserved for P0 type issues.

colinbartlett

3 hours ago

This delay in status page acknowledgement is a huge reason that my app, StatusGator, has blown up in popularity recently.

We are now regularly detecting outages long before providers acknowledge them which is hugely beneficial to IT teams.

For this Reddit outage, we alerted 13 minutes before the official status page.

Last weeks Azure outage, it was 42 minutes prior (!?!).

slig

4 hours ago

They lie. Meta ad deliver has been completely fucked for about two months, and they rarely update their status page [1]. When they do, they update with "medium disruptions" hours late and then with "resolved" hours before actually solving. I could understand the "resolved" before because of "web scale" and deploying slowly.

This third partt status page reflects the issues much better [2].

[1]: https://metastatus.com/ads-manager [2]: https://statusgator.com/services/meta

JohnFen

4 hours ago

I always assumed it's because the pages are manually updated.

amelius

6 hours ago

Because legal rights can be derived from it.

arccy

5 hours ago

Another way to look at it is: you already know the service is down because you can't use it. The status page being manually updated means someone is aware and actually working on fixing it, rather than it being automated and the other side just ignoring it...

gwbas1c

4 hours ago

That's an extremely insightful comment; I can't believe it's being downvoted.

I would add that I once had to file an FCC complaint when my internet went down. Comcast kept sending me to an AI bot; it appeared that they automated their status page and customer support phone line to the point where I didn't believe that an actual person was aware of the problem and fixing it.

sjsdaiuasgdia

5 hours ago

Status pages usually start as a human-updated thing because it's easy to implement.

Some time later, you might add an automated check thing that makes some synthetic requests to the service and validates what's returned. And maybe you wire that directly to the status page so issues can be shown as soon as possible.

Then, false alarms happen. Maybe someone forgot to rotate the credentials for the test account and it got locked out. Maybe the testing system has a bug. Maybe a change is pushed to the service that changes the output such that the test thinks the result is invalid. Maybe a localized infrastructure problem is preventing the testing system from reaching the service. There's a lot of ways for false alarms to appear, some intermittent and some persistent.

So then you spread out. You add more testing systems in diverse locations. You require some N of M tests to fail, and if that threshold is reached the status page gets updated automatically. That protects you from a few categories of false alarms, but not all of them.

You could go further to continue whacking away at the false alarm sources, but as you go you run into the same problem of service reliability, where each additional "9" costs much more than the one that came before. You reach a point where you realize the cost of making your automatic status page updates fully automatically accurate becomes prohibitive.

So you go back to having a human assess the alarm and authorize a status page update if it is legitimate.

wahnfrieden

6 hours ago

Because they are editorialized

knorker

6 hours ago

To add to the reasons others gave: It needs to be correct.

Engineers are working the problem. They have a pretty good understanding of the impact of the outage. Then an external comms person asks for an engineer to proof read the external outage comms. Which triggers rounds of "no, this part is not technically correct" and "I know the internal system scope impact, but not how that maps to external product names you want to communicate".

Sure, it'd be nice if the message "we are investigating an issue with… uh… some products" would come up faster.

2gremlin181

6 hours ago

I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.

gwbas1c

5 hours ago

Automation is always significantly easier said than done. Furthermore, as has been emphasized elsewhere in this thread, there is a critical need for a human in the loop.

knorker

3 hours ago

> I totally get that, but how hard would it be to actually make calls to your own API from the status page?

Ah, so you're saying the status page should be hooked up to internal monitoring probers?

So how sure are you that it's the service that's broken, and not the probers? How sure are you that the granularity of the probers reflect the actual scope of the outage?

Also this opens up questioning of "well why don't you have probing on the EXACT workflow that happened to break this time?!". Because honestly, that's not helpful.

Say you have a complete end to end workflow for your web store. Should you publish "100% outage, the webstore is down!!" on your status page, automatically, because the very diligent prober failed to get into the shoe section of your store? That's probably not helpful to anybody.

> Clearly these metrics and alerts exist internally too.

Well, no. Probers can never cover every dimension across which a service can have an outage. You may think that the service is simple and has an obvious status, but you're using like 0.1% of the user surface, and have never even heard of the weird things that 99% of actual traffic does.

How do you even model your minority use case? Is it an outage? Or is your workflow maybe a tiny weird one, even though you think it's the straightforward one?

Especially since the nature of outages in complex systems tend to be complex to describe accurately. And a status page needs to boil it down to simple.

In many cases even engineers inspecting the system can not always be certain if real users are experiencing an outage, or if they're chasing an internal user, or if nothing is user visible because internal retries are taking care of everything, or what.

Complex systems are often complex because the world is complex. And if the problem is simple and unevolving then there would be no reason to have outages in the first place.

And often engineers helping phrase an outage statement need to compromise verbosity for clarity.

Another thing is what do you do if you start serving 500s to 90% of traffic? An outage, right? Surely auto-publish to a status page? Oh, but it turns out this was a DoS attack, and no non-DoS traffic was affected. Can your monitoring detect the difference? Unlikely.