swiftcoder
6 hours ago
Because for most major sites, updating the status page requires (a significant number of) humans in the loop.
Back when I worked at a major cloud provider (which admittedly was >5 years ago), our alarms would go off after ~3-15 minutes of degraded functionality (depending on the sensitivity settings of that specific alarm). At that point the on call gets paged in to investigate and validates that the issue is real (and not trivially correctable). There was also automatic escalation if the on call doesn't acknowledge the issue after 15 minutes.
If so, a manager gets paged in to coordinate the response, and if the manager considers the outage to be serious (or to affect a key customer), a director or above gets paged in. The director/VP has the ultimate say about posting an outage, but they in parallel consult the PR/comms team to consult on the wording/severity of the notification, any partnership managers for key affected clients, and legal re any contractual requirements the outage may be breaching...
So in a best-case scenario you'd have 3 minutes (for a fast alarm to raise) plus ~5 minutes for the on call to engage, plus ~10 minutes of initial investigation, plus ~20 minutes of escalations and discussions... all before anyone with permission to edit the status page can go ahead and do so
mirekrusin
4 hours ago
So there is access to "degraded functionality" from start (the "3-15" of "degraded functionality" one) - people are asking why not share THAT then?
Nobody cares about internal escalations, if manager is taking shit or not - that's not service status, that's internal dealing with the shit process - it can surface as extra timestamped comments next to service STATUS.
swiftcoder
4 hours ago
> why not share THAT then?
When you've guaranteed 4 or 5 nines worth of uptime to the customer, every acknowledged outage results in refunds (and potentially being sued over breach of contract)
xigoi
2 hours ago
On the other hand, if they’re down but don’t report it, couldn’t they be sued for fraud?
jakevoytko
4 hours ago
Because the systems are so complex and capable of emergent behavior that you need a human in the loop to truly interpret behavior and impact. Just because an alert is going off doesn't mean that the alert was written properly, or is measuring the correct thing, or the customer is interpreting its meaning correctly, etc.
2gremlin181
6 hours ago
Copying my response over from another comment:
I totally get that, but how hard would it be to actually make calls to your own API from the status page? If it fails, display a vague message saying there might be issues and that you are looking into it. Clearly these metrics and alerts exist internally too. I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.
rozenmd
5 hours ago
There are increasingly more status pages that automatically update based on uptime data (I built a service providing that - OnlineOrNot)
But early-stage startups typically have engineering own the status page, but as they grow, ownership usually transfers to customer support. These teams optimize for controlling the message rather than technical detail, which explains the shift toward vaguer/slower incident descriptions.
Yeri
6 hours ago
Because you'd have a ton of downtime and they'd rather hide it if they could. :)
I used to work at a very big cloud service provider, and as the initial comment mentioned, we'd get a ton of escalations/alerts in a day, but the majority didn't necessarily warrant a status page update (only affecting X% of users, or not 'major' enough, or not having any visible public impact).
I don't really agree with that, but that was how it was. A manger would decide whether or not to update the status page, the wording was reviewed before being posted, etc. All that takes a lot of time.
swiftcoder
4 hours ago
Not hard at all (our internal dashboards did just that). But to have that data posted publicly was not in the best interests of the business.
And honestly, having been on a few customer escalations where they threatened legal action over outages, one kind of starts to see things the business way...
dvt
5 hours ago
> Just stop gaslighting me.
I heard this years ago from someone, but there's material impact to a company's bottom line if those pages get updated, so that's why someone fairly senior has to usually "approve" it. Obviously it's technically trivial, but if they acknowledge downtime (for example, like in the AWS case), investors will have questions, it might make quarterly reports, and it might impact stock price.
So it's not just a "status page," it's an indicator that could affect market sentiment, so there's a lot of pressure to leave everything "green" until there's no way to avoid it.
FinnKuhn
5 hours ago
I feel like there should at least be some sort of disclaimer then that tells me the status page can take up to xx minutes to show an outage and not make it seem as if it is updated instantaniously. That way I could way those xx minutes before I file a ticket with support and not have the case thinking it is an isolated problem for me instead of a major outage.