swiftcoder
3 months ago
Because for most major sites, updating the status page requires (a significant number of) humans in the loop.
Back when I worked at a major cloud provider (which admittedly was >5 years ago), our alarms would go off after ~3-15 minutes of degraded functionality (depending on the sensitivity settings of that specific alarm). At that point the on call gets paged in to investigate and validates that the issue is real (and not trivially correctable). There was also automatic escalation if the on call doesn't acknowledge the issue after 15 minutes.
If so, a manager gets paged in to coordinate the response, and if the manager considers the outage to be serious (or to affect a key customer), a director or above gets paged in. The director/VP has the ultimate say about posting an outage, but they in parallel consult the PR/comms team to consult on the wording/severity of the notification, any partnership managers for key affected clients, and legal re any contractual requirements the outage may be breaching...
So in a best-case scenario you'd have 3 minutes (for a fast alarm to raise) plus ~5 minutes for the on call to engage, plus ~10 minutes of initial investigation, plus ~20 minutes of escalations and discussions... all before anyone with permission to edit the status page can go ahead and do so
2gremlin181
3 months ago
Copying my response over from another comment:
I totally get that, but how hard would it be to actually make calls to your own API from the status page? If it fails, display a vague message saying there might be issues and that you are looking into it. Clearly these metrics and alerts exist internally too. I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.
rozenmd
3 months ago
There are increasingly more status pages that automatically update based on uptime data (I built a service providing that - OnlineOrNot)
But early-stage startups typically have engineering own the status page, but as they grow, ownership usually transfers to customer support. These teams optimize for controlling the message rather than technical detail, which explains the shift toward vaguer/slower incident descriptions.
Yeri
3 months ago
Because you'd have a ton of downtime and they'd rather hide it if they could. :)
I used to work at a very big cloud service provider, and as the initial comment mentioned, we'd get a ton of escalations/alerts in a day, but the majority didn't necessarily warrant a status page update (only affecting X% of users, or not 'major' enough, or not having any visible public impact).
I don't really agree with that, but that was how it was. A manger would decide whether or not to update the status page, the wording was reviewed before being posted, etc. All that takes a lot of time.
swiftcoder
3 months ago
Not hard at all (our internal dashboards did just that). But to have that data posted publicly was not in the best interests of the business.
And honestly, having been on a few customer escalations where they threatened legal action over outages, one kind of starts to see things the business way...
dvt
3 months ago
> Just stop gaslighting me.
I heard this years ago from someone, but there's material impact to a company's bottom line if those pages get updated, so that's why someone fairly senior has to usually "approve" it. Obviously it's technically trivial, but if they acknowledge downtime (for example, like in the AWS case), investors will have questions, it might make quarterly reports, and it might impact stock price.
So it's not just a "status page," it's an indicator that could affect market sentiment, so there's a lot of pressure to leave everything "green" until there's no way to avoid it.
FinnKuhn
3 months ago
I feel like there should at least be some sort of disclaimer then that tells me the status page can take up to xx minutes to show an outage and not make it seem as if it is updated instantaniously. That way I could way those xx minutes before I file a ticket with support and not have the case thinking it is an isolated problem for me instead of a major outage.
mirekrusin
3 months ago
So there is access to "degraded functionality" from start (the "3-15" of "degraded functionality" one) - people are asking why not share THAT then?
Nobody cares about internal escalations, if manager is taking shit or not - that's not service status, that's internal dealing with the shit process - it can surface as extra timestamped comments next to service STATUS.
swiftcoder
3 months ago
> why not share THAT then?
When you've guaranteed 4 or 5 nines worth of uptime to the customer, every acknowledged outage results in refunds (and potentially being sued over breach of contract)
xigoi
3 months ago
On the other hand, if they’re down but don’t report it, couldn’t they be sued for fraud?
user
3 months ago
chrismorgan
3 months ago
Meh, I’ve never seen an uptime (SLA) guarantee that was worth anything anyway. They’re consistently toothless, publicly-offered ones anyway (can’t comment on privately-negotiated ones). I’ve written about it a few times, with a couple of specific examples: https://hn.algolia.com/?type=comment&query=sla+chrismorgan.
But not acknowledging actual outages, yeah, that would open you up to accusations of fraud, which is probably in theory much more serious.
jakevoytko
3 months ago
Because the systems are so complex and capable of emergent behavior that you need a human in the loop to truly interpret behavior and impact. Just because an alert is going off doesn't mean that the alert was written properly, or is measuring the correct thing, or the customer is interpreting its meaning correctly, etc.
mirekrusin
3 months ago
Health probes are at the easiest side of software complexity spectrum. It has nothing to do with it and everything with managing reputational damage in shady way.
user
3 months ago