benjaminwootton
9 days ago
How can it take 3-4 months to get an eCommerce site back online? I assume you could redeploy everything from scratch in less time if you have source code and release assets. With backups and failover sites I can’t think of any world where this would happen?
paxys
9 days ago
It isn't surprising at all. There's a reason why tech companies have insanely large engineering teams even though it feels to an outsider (and inept management) that nobody is doing anything. It takes a lot of manpower and hours to keep a complex system working and up to date. Who validates the backups? Who writes the wikis? Who trains new hires? Who staffs all the on-call rotations? Who organizes disaster recovery drills? Who runs red team exercises? After the company has had repeated layoffs and fired, outsourced or otherwise pushed out all this "overhead" eventually there's no one remaining who actually understands how the system works. One small outage later, this is exactly the situation you end up in.
chatmasta
9 days ago
Sure, but for every efficiently run company, there’s another with 80% of its engineers working on a “new vision” with zero customers, while the revenue-generating software sits idle or attended by one or two developers…
And maybe this is intentional, rational strategy - why not reinvest profits in R&D? But just because an organization is large does not mean that it’s efficient.
coliveira
9 days ago
Agreed, and that is a wonderful punishment to these companies.
phatfish
9 days ago
Yup, it turns out all those Indian contractors/outsourced staff don't really give a shit.
gosub100
9 days ago
They're paid not to. Disagreement rocks the boat, they get fired and have 2 weeks to pack it up and fly home.
sunshinerag
8 days ago
What does race got to do with it?
nebula8804
9 days ago
And yet somehow Twitter plods along.
chownie
8 days ago
It turns out if you login-wall the site you can get away with a few catastrophic outages, people won't even remember them.
wiether
8 days ago
I never understood this Twitter thing.
- Everybody in their right mind agreed that, for what they were achieving, Twitter was completely over-staffed. Like most of the big tech co in this period. And like most of those co, they went through a leaning program with mass layoffs.
- If the service is running fine with only 10% of the staff, it doesn't necessarily means that the 90% that got fired were useless. I can get a 6yo to heat their food using a microwave. Does it mean that the kid is a genius, or that the people who made the microwave did it in a way that allows a kid to operate it, even though it's a complex system at its core?
- Comparing Twitter to an international eCom website is disingenuous. If "design Twitter" is a common system design interview question, it's not because the website is popular, it's because the basics are quite simple. Whereas, behind an eCom website, there's dozens of parts going on at any time, with hundreds of interoperability issues. You're not mainly relying on your main DB for your data, most of it is coming from external systems.
cloudsec9
7 days ago
"The basics are simple" ... hrm. I think the concepts of how Twitter works is simple; but during their "fail whale" days, they had to reinvent things on the fly to achieve scale and reliability.
Twitter used to have lots of moving parts, and money flowed from various ads and placements, and that was much more complex then I think people appreciate. With their new head twit, they destroyed most of their ad revenue stream and are now hyping paying for APIs and posting privileges, and it shows.
My main irritation is that people say it "works fine", when lots of crap is broken all over, and it now has regular outages. They have shipped like ONE feature that was mostly done earlier since the takeover.
CobrastanJorji
9 days ago
Yep. It takes way fewer people to operating a working system than to build a new one. And the nature of capitalism is that you will pare down your numbers until you have the absolute minimum staffing you need to keep the lights on. Then when everything explodes, you completely lack the know-how to fix it. Then the CEO yells as the tech executive who responds by demanding hourly updates from the two junior devs who operate the site, and nobody wants to admit that they aren't capable of fixing it, and nobody's gonna OK a really expensive "we're gonna spend a month emergency building a new thing" plan because nobody's okay with because a month is obviously way too much time you need to fix it right now, and then three months go by and here you are.
chatmasta
9 days ago
A friend of a friend told me about an organization that has a steady income from existing products maintained by just enough engineers to keep the lights on, while the other 80% of the organization is building the “new version” that no customer asked for and that nobody is currently paying for. There’s one product that is used by more than 80% of customers that’s maintained by 2 developers and that the CEO isn’t aware even exists.
jemmyw
9 days ago
Ya I've been there. I even tried pitching to management that a small team of us wanted to move to the legacy product and iteratively improve it because it had customers and revenue and we could make an impact while the new product was under development. They said no. I left about 6 months after. 9 years later the legacy product is still running. I can't find any evidence that they launched a new one.
spacebanana7
9 days ago
I get the opposite impression. Stale software organisations with steady operating products seem to use massive headcounts, whereas startups building new products often get by with relatively few people.
esseph
9 days ago
Startups don't have to run a software stack for decades, hardware refreshes or SKU updates and replatforms, dealing with multiple types of turnover and reogs, knowledge transfer, etc.
Plus at least monthly if not daily, even hourly system patching.
Planting a garden is one thing.
Weeding it is another.
donnachangstein
9 days ago
> whereas startups building new products often get by with relatively few people
90% of startups fail within 5 years so probably not the best example of how to run things.
The few that do "succeed" often carry over mountains of cruft and garbage code into perpetuity (for example Reddit).
ksec
9 days ago
Which means it is an opportunity for most of these to be SaaS and not internal. I wish Shopify could help them to migrate to their own.
nebula8804
9 days ago
[flagged]
levocardia
9 days ago
I'm sorry but if an enterprise team can't at least get a stopgap ecommerce site up and running in a week, what are you even doing? Literal amateurs can launch a WooCommerce site from nothing in a weekend; two Stanford grads in YC can do a hundred-fold better than that. Yes, a big site is more complicated, maybe there will be some frazzled manual data entry in Excel sheets while your team gets the "real" site back up, but this is total madness.
donnachangstein
9 days ago
> what are you even doing?
Forensics, among a hundred other things.
> Literal amateurs can launch a WooCommerce site from nothing in a weekend
Selling low-volume horseshit out of your garage is in no way comparable to running a major eCommerce site.
> two Stanford grads in YC can do a hundred-fold better than that.
No they literally can't.
> Yes, a big site is more complicated, maybe there will be some frazzled manual data entry in Excel sheets while your team gets the "real" site back up
Great idea, we'll have Chloe in Accounts manage all the orders in a million-row Excel sheet. Only problem might be they come in at 50 orders a minute, but don't worry I hear she's a fast typist.
cjs_ac
9 days ago
Your comment suggests that you're not familiar with the diversity in M&S' operation.
Marks and Spencers started as a department store; they still have this operation. They sell clothes, beauty products, cookware, homeware and furniture. All these things are sold in physical shops and online. Most of this is straightforward for an e-commerce operation, but the furniture will involve separate warehousing and delivery systems.
They also offer financial services (bank accounts, credit cards and insurance). These are white labelled products, but they are closely linked to their loyalty programme (the Sparks card).
Finally, they have their food operation: M&S is also a high-end supermarket. You can't do your food shop on the M&S website (although their food products are available from online-only supermarket Ocado), but you can order some food products (sandwich platters and party food) and fresh flowers from the website.
So M&S is a mid-tier department store and a high-end supermarket. These are very different styles of retail operation: supermarkets require a lot of data processing to ensure the right things get to the right shops at the right time to ensure that food doesn't go to waste but also shoppers aren't annoyed by the unavailability of staples like bread and milk.
Finally, M&S is traditionally fairly strong in customer service; it's not exactly Harrod's or Fortnum and Mason's, but their bra-fitting service, for example, has a legendary reputation. The internet isn't their natural home.
So all-in-all, you have a business doing complicated things online because they think they have to, not because they want to: a pretty clear recipe for disaster.
neepi
9 days ago
Their banking op is a fucking mess as well. Had no end of problems with their card services which were rebranded HSBC.
didroe
9 days ago
How do you know it's safe to redeploy? If your entire operation may be compromised, how can you trust the code hasn't been modified, that some information the attackers have doesn't present a further threat, or that flaws that allowed the attack aren't still present in your services? It's a large company so likely has a mess of microservices and outsourced development where no-one really understands parts of it. Also, if they get compromised again it would be a PR disaster.
They're probably having to audit everything, invest a lot of effort in additional hardening, and re-architect things to try and minimise the impact of any future attack. And via some bureaucratic organisational structure/outsourcing contract.
ajb
8 days ago
You literally have some of your team buy new laptops and hang out in a temporary wework to set it up on entirely new infra, air-gapped from your ongoing forensic exercise. You just need to make sure none of the people you send are dumb enough to reuse their password. You need to take the domain name, but they will be using one of the high end domain companies so that can be handled.
Bear in mind that this is a company which still sells physically and has retail and warehouse staff. All that the e-commerce side needs to do is issue orders of what skus to send to what addresses, and pause items that are out of stock. M&S is not Amazon and doesn't have that many SKUs, 5 people could probably walk round the store in a few days and photograph all of them for the new shopping site.
Sure, customers will need to make a new account or buy as a guest. But this stuff is not hard on the technical side. There is no interaction between customers like a social media site, so horizontal scaling is easy.
Now I get that there are loads of refinements that go into maximising profit, like analytics, price optimization, etc. But to get in revenue these guys don't even need to set up advertising on day one because they have customers that have been buying from them for decades. The time to set up all that stuff is when your revenue is nonzero
prmoustache
8 days ago
> M&S is not Amazon and doesn't have that many SKUs, 5 people could probably walk round the store in a few days and photograph all of them for the new shopping site.
I can't speak about M&S buy all big physical retail brand which started selling online are exactly operating as Amazon with SKUs coming from various third party entities. The offering is much bigger than what is sold at the physical shops.
ajb
8 days ago
I had the impression that M&S wasn't, but if that's the case then yeah, that would invalidate my analysis. Especially if even their retail stock goes through that route when bought online.
Oras
8 days ago
I don’t think you realise how complicated the e-commerce is for a company. You are thinking of a garage sale.
With each order:
- you need warehouse integration to keep the sync of physical to digital store. That has to happen fast or you’ll get orders with no stock.
- You need to sync the payment to whatever ancient accounting system they use, again while issuing invoices, consolidating customers … etc.
- Logistics management, where to get the order from, issuing a label, using the right fleet, making sure it is dispatched on time, arrive on time.
- Customer support, refunds, partial refunds, adding items after order … etc.
So yeah, 5 people!
user
8 days ago
ajb
8 days ago
I didn't say 5 people in total
donnachangstein
9 days ago
HN posters love talking gangster shit when something goes offline but never walked a mile in their boots.
I most recently remember sifting through gloating that 4chan - a shoestring operation with basically no staff - was offline for a couple weeks after getting hacked.
I've worked at a shop that had DR procedures for EVERYTHING. The recovery time for non-critical infra was measured in months. There are only so many hands to go around, and stuff takes time to rebuild. And that's assuming you have procedures on file! Not to mention if there was a major compromise you need to perform forensics to make sure you kick the bad guys out and patch the hole so the same thing doesn't happen again a week after your magical recovery.
And if you don't know, you shut it down till it's deemed safe. How do you know the backups and failover sites aren't tainted? Nothing worse than running an e-commerce site processing customer payment card data when you know you're owned. That's a good way to get in deeper trouble.
kelnos
9 days ago
I'm not that surprised, though 3-4 months does feel like a long time.
When I was at early Twilio (2011? 2012? ish), we would completely tear down our dev and staging environments every month (quarter? can't remember), and build them back up from scratch. That was everything, including databases (which would get restored from backup during the re-bring-up) and even the deployment infrastructure itself.
At that point we were still pretty small and didn't have a ton of services. Just bringing my product (Twilio Client) back up, plus some of the underlying voice services, took about 24 hours (spread across a few days). And the bits I handled were a) a small part of the whole, and b) some of the easier parts to bring up.
We stopped doing those teardowns sometime later in 2012, or perhaps 2013, because they started taking way too much time away from doing Actual Work. People can't get things done when the staging environment is down for more than a week. Over the following 10 years or so, Twilio's backend exploded in complexity, number of services, and the dependencies between those services.
I left Twilio in early 2022, and I wouldn't have been surprised if it would have taken several months to bring up Twilio (prod) from scratch at that point, though in their case it would be a situation where some products and features would be available earlier than others, so it's not really the same as an e-commerce site. And that was when I left; I'm sure complexity has increased further in the past 3 years.
Also consider that institutional knowledge matters too. I would guess that for all the services running at Twilio, the people who first brought up many (most?) of them are long gone. So I wouldn't be surprised if the people at M&S right now just have no idea how to bring up an e-commerce site like theirs from scratch, and have to learn as they go.
wrs
9 days ago
“If you have source code and release assets.” And a build process that works from a clean code base. And a deploy process that works on fresh servers.
All of which assumes you even know what services exist, which in any company of this age and size you probably don’t.
pavel_lishin
9 days ago
> with backups and failover sites
What a fun pair of assumptions!
chatmasta
9 days ago
The Co-Op (grocery store chain) was hacked around the same time in likely the same incident. It took three weeks for them to get food back on the shelves at my local store. I don’t understand how that’s even possible… what happened to all the meat and vegetables in the supply chain? They just stopped flowing? They rotted? Why couldn’t they use pen and paper? It’s unbelievable to me that a business would go three weeks without stocking inventory.
tonyhart7
9 days ago
You can say this because ignorant, stock inventory is really hard especially huge warehouse where many items come and go 24/7
they can "move" it of course but who can guarantee how many amount goes from where and who ????
paper and pen where there are thousand items in single rack is nightmare, I can tell you that
gosub100
9 days ago
Don't call someone ignorant for asking a question. He said "I don't understand how". If you know the answer, answer. Don't call him ignorant.
tonyhart7
8 days ago
Yeah, I can do better in that part
at first I thought he underestimate this part of industry, I initially thought this because its common on HN mock tech company
chatmasta
9 days ago
well, apparently co-op couldn’t answer those questions with their computers because they got locked out of them…
glenjamin
9 days ago
I chatted to a staff member on the checkout of my local coop supermarket
She said that every shelf item is ordered on a JIT basis as the store stock levels require them - there are no standing orders to a store
Based on that, I presume they didn’t really know what any store would need
Even when they were struggling my local store still had a decent stock of lots of stuff - just some shelves were empty
bobthepanda
9 days ago
You could (and people did) run this in the pre-internet days with basically just phone calls and a desk to receive them. The problem is that by now this represents an incredible increase in manpower required overnight.
grues-dinner
9 days ago
And you need a process to follow. You can't just have nearly 4000 supermarkets ringing up HQ at random and reading out lists of 1000 items each. Then what? Back when a supermarket chain did operate like that, the processes like "fill in form ABC in triplicate, forward two to department DEF for batching and then the forward one to department GHI for supplier orders and they produce forms XYZ to send to department JKL for turning into orders for dispatch from warehouses". And so on and so on. You can't just magic up that entire infrastructure and knowledge even if you could get the warm bodies to implement it. Everyone who remembers how to operate a system like that is retired or has forgotten the details, all the forms were destroyed years ago and even the buildings with the phones and vacuum tubes and mail rooms don't exist.
Of course you could stand up a whole new system like that eventually, but you could also use the time to fix the computers and get back to business probably sooner.
But I imagine during those 3 weeks, there were a lot of phone calls, ad-hoc processes being invented and general chaos to get some minimal level of service limping along.
7952
8 days ago
I agree, although it seems like a failure of imagination that this is so difficult. The staff will have a good understanding of what usually happens and what needs to happen. What they are lacking is some really basic things that are the natural monopoly of "the system".
Perhaps we need fallback systems that can rebuild some of that utility from scratch...
* A communication channel of last resort that can be bootstrapped. Like an emergency RCS messaging number that everyone is given or even a print/mailing service.
* A way to authenticate people getting in touch using photo ID, archived employee data or some kind of web of trust.
* A way to send messages to everyone using a he RCS system.
* A way to commission printing, delivery and collection of printed forms.
* A bot that can guide people to enter data into a particular schema.
* An append only data store that records messages. A filtering and export layer on top of that.
* A way to give people access to an office suite outside of the normal MS/Google subscription.
* A reliable third party wifi/cell service that is detached from your infrastructure.
* A pool of admin people who can run OCR, do data entry.
Basically you onboard people onto an emergency system. And have some basic resources that let people communicate and start spreadsheets.
bobthepanda
5 days ago
part of the problem with emergency systems is that whatever emergency system is going to take you from zero to over capacity on whatever system it is, particularly if you are requiring communication from suddenly over-burdened human staff working frantically, and these processes may break down because of that.
chatmasta
9 days ago
> Everyone who remembers how to operate a system like that is retired or has forgotten the details
Anyone who’s experienced the sudden emergence of middle management might feel otherwise :) please don’t teach those people the meaning of “triplicate,” they might try to apply it to next quarter’s Jira workflows…
grues-dinner
9 days ago
One day you'll find a sheet of carbon paper in the office laserjet and you'll know it's starting.
I wonder if we could negotiate a return to typewriters and paper if it means individual offices and a tea trolley?
user
9 days ago
chatmasta
9 days ago
I remember when I was a teenager working the register at a local store. The power went out one day, and we processed credit cards with a device that imprinted the embossed card number onto a paper for later reconciliation.
That wouldn’t work today for a number of reasons but it was cool to see that kind of backup plan in place.
phinnaeus
9 days ago
I’ve seen cc impression machines within the past 5 years in small town america
fredoralive
8 days ago
In the UK the credit / debit cards I've had issued in the last few years have been flat, with details just printed, so that level of manual processing is presumably defunct here.
OJFord
8 days ago
Don't forget chip & PIN is state of the art novel tech in the US. (From memory I think it was required here in the UK from Valentine's day^ in something like 2005.)
(^I remember the day better than the year because the ad campaign was something like 'I <3 PIN'.)
bobthepanda
6 days ago
that is mostly because major US retailers sued Visa/Mastercard to make it not enforceable via lower interchange fees, since then they would have to change tens of thousands of point-of-sale systems at each one
chatmasta
9 days ago
In my case all the perishable shelves were empty - no fruit, no vegetables, no meat, no dairy. I checked every few days for multiple weeks and it wasn’t until three weeks after the incident I was able to buy chicken again.
It’s possible they were ordering some default level of stock and I just didn’t go at the right time to see it, but it sure looked like they were missing the inventory… when I first asked the lady “is the food missing because of the bank holiday?” and she said “no because of the cyber attack” I thought she was joking! It reminded me of the March 2020 shelves.
Henchman21
9 days ago
You forget we have entered the “Who the fuck cares?” era. When no one in the chain is incentivized to care, things just fall apart.
chatmasta
9 days ago
Interestingly Co-Op is so-called because it’s a cooperative business, which vaguely means it’s owned by its employees, and technically means it’s a “Registered Society” [0].
If you check CompaniesHouse [1], which normally has all financial documents for UK corporations, it points you to a separate “Public Register” for the Co-Op [2].
So, your comment has more basis in reality than simply being snark… the fact that “nobody is incentivized to care” is actually by design. That has some positive benefits but in this case we’re seeing how it breaks down for the same reasons nobody in a crowd calls an ambulance for someone hurt… it’s the bystander effect applied to corporate governance with diluted accountability.
[0] https://www.gov.uk/hmrc-internal-manuals/company-taxation-ma...
[1] https://find-and-update.company-information.service.gov.uk/c...
bonaldi
9 days ago
I’m not following your logic. The co-op is designed for everyone to care _more_ because they are part-owners and because the organisation is set up for a larger good than simple profit-making.
In practice the distinction has long been lost both for employees and members (customers), but the intent of the organisational structure was not for nobody to care; quite the opposite
chatmasta
9 days ago
But there are millions of part-owners. Every “member” of co-op (i.e. a customer in the same membership program that just lost all their data to this hack) is an owner of it. Maybe the employees get more “shares” but it’s not at all significant.
And at the executive governance level, there are a few dozen directors.
There is a CEO who makes £750k a year, so it has elements of traditional governance. I’m not saying the structure is entirely to blame for the slow reaction to the hack, or that there is zero accountability, but it’s certainly interesting to see the lack of urgency to restore business continuity.
My family used to own a local market, and as my dad said when I told him this story, “my father would have been on the farm killing the chickens himself if that’s what he had to do to ensure he had inventory to sell his customers.”
You simply won’t get that level of accountability in an organization with thousands of stakeholders. And a traditional for-profit corporation will have the same problems, but it will also have a stock price that starts tanking after half a quarter of empty shelves. The co-op is missing that sort of accountability mechanism.
Henchman21
9 days ago
Responsibility diluted to the point of no actual responsibility?
chatmasta
9 days ago
Exactly, the bystander effect. But it’s not strictly due to the large size. Other big companies get hacked too. But if they have a stock price then there’s an obvious metric to indicate when the CEO needs to be fired. It’s the dilution of responsibility combined with a lack of measurable accountability that causes the dysfunction.
grues-dinner
9 days ago
The problem is that cutting IT and similar functions to the bone is really good for CEOs. It juices the profits in the short/mid term, the stock price goes up because investors just see line go up, money goes in, and the CEO gets plaudits. There's only one figure of merit: stock price. What you measure is what you get.
It's only much later that the wheels fall off and it all goes to hell. The hack isn't a result of the CEOs actions this quarter, it's years and years of cumulative stock price optimisation for which the CEO was rewarded.
And you can't even blame all the investors because many will be diluted and mixed though funds and pensions. Is Muriel to blame because her private pension, which everyone told her is good and responsible financial planning, invested in Co-Operative Group on the back of strong growth and "business optimisation intiatives"? Is she supposed to call up Legal and General and say "look I know 2% of my pension is invested in Co-Op Group Ltd and it's doing well, and yes I'm with you guys because you have good returns, but I'm concerned their supermarket division is outsourcing their IT too much, could you please reduce my returns for the next few years and invest in companies that make less money by doing the IT more correctly?"
The incentives are fucked from end to end.
Henchman21
9 days ago
I guess this is more snark, but honestly I am genuinely shocked when people care about anything anymore. Sad times.
chatmasta
9 days ago
There is a serious crisis of competence and caring all throughout society and it is indeed frightening. It’s this nagging worry that never goes away, while little cracks keep appearing in the mechanisms we usually take for granted…
coliveira
9 days ago
When everything is done by computers, no human really knows what needs to be done even for a simple thing as buying vegetables.
TheOtherHobbes
9 days ago
Buying and distributing vegetables for stores is not remotely a simple thing. It includes statistical analysis with estimates of demand for every store, seasonal scheduling, weather awareness, complicated national and/or international logistics, plus accounting and payments.
Some or all of those may be broken during a cyberattack.
chatmasta
9 days ago
That’s a good point but perhaps you underestimate the ingenuity borne from constraints.
If you’ve got trucks arriving with meat that’s going to expire in a week, and all your stores have empty shelves, surely there is a system to get that meat into customer mouths before it expires. It could be as simple as asking each store, when they call (which they surely will), how much meat they ordered last week, and sending them the same this week. You could build out more complicated distribution mechanisms, but it should be enough to keep your goods from perishing until you manage to repair your digital crutch.
7952
8 days ago
The suppliers will know and be able to predict what a large customer like M&S is likely to order. They will probably be preparing items before they are even ordered. And surely their must be some kind of understanding of what a typical store will receive.
halpow
9 days ago
[dead]
cameronh90
9 days ago
The British Library still aren't fully back up and running after their cyberattack in Oct 2023: https://www.bl.uk/cyber-incident/
tw04
9 days ago
So you haven’t dealt with ransomware gangs yet? Because they have gotten sophisticated enough to nuke source code repos and backups and replicated copies.
It’s part of the reason tape is literally never going to die for organizations with data that simply cannot be lost, regardless of rto.
dylan604
9 days ago
For this particular audience, it's one of those things that could be rewritten in Rust over a weekend and then deployed on the cheap via Hetzner. At least then it'll be memory safe!
user
9 days ago
briffle
9 days ago
of course, if you redeployed everything from the source code, you could very well still have the same vulnerabilities that caused the problem in the first place..
internetter
9 days ago
There are no backups. There are no failovers. There is no git. There is no orchestration and deployment stratagies. Programmers ssh into the server and edit code there. Years and years of patchwork on top of patchwork with closely coupled code.
Such is a taste of what needs to be done if you wish to have a service that takes months to set back up after any disruption.
squiffsquiff
9 days ago
This is an ignorant position. Look at e.g. https://engineering.marksandspencer.com/mobile/2024/09/05/re...
throwawaymgb123
9 days ago
This is a perfect description of how things work at one of the largest health care networks in the northeast US (speaking as someone who works there and keeps saying "where's the automation? where are the procedures?" and keeps being told to shut up, we don't have TIME for that sort of thing.
internetter
9 days ago
lol the healthcare industry was definitely in my mind as I wrote this. Never worked there but I read a lot of postmortems and it shows whenever I use their digital products. Recent example is CVS.
Somehow, at some point, they decided that my CVS pharmacy account should be linked to my Mom's extracare. Couldn't find any menu to fix it online. So the next time I went to the register I asked to update it. They read the linked phone number. It was mine. Ok, it is fixed, I think. But then the reciept prints out and it is my mom's Extracare card number. So the next time I press harder. I ask them to read me the card number they have linked from their screen. They read my card number. Ok, it is fixed, I think. But then the reciept prints out and the card number is different—it is my mom's. Then I know the system is incredibly fucked. Being an engineer, I think about how this could happen. I'm guessing there are a hundred database fields where the extracare number is stored, and only one is set to my mom's or something. I poke around the CVS website and find countless different portals made with clearly different frameworks and design practices. Then I know all of CVS's tech looks like this and a disaster is waiting to happen.
Goes like this for a lot of finance as well.
E.g. I can say with confidence that Equifax is still as scuffed as it was back in 2017 when it was hacked. That is a story for another time.
Nobody bothers to keep things clean until it is too late. The features you deliver give promotions, not the potential catastrophes you prevent. Humans have a tendency to be so short sighted, chasing endless earnings beats without anticipating future problems.
aspenmayer
9 days ago
If you don't have time to prepare for failure, then you'll have little time to invest in success, either, if/when failure strikes.
98codes
9 days ago
[citation needed]
internetter
9 days ago
Sorry if I phrased it poorly. I wasn’t definitively saying that all these things are the case. But what always is the case is that when an attack takes down an organization for months, it was employing a tremendous number of horrendous practices. My list was supposed to be some.
M&S isn’t down for months because of something innocuous like a full security audit. As a public company losing tens of millions of dollars a week, their only priority is to stop the bleed, even if that means a hasty partial restoration. The fact they can’t even do that suggests they did stuff terribly wrong. There’s an infinite amount of things I didn’t list that could also be the case. Like if Amazon gave them proprietary blobs they lost after the attack and Amazon won’t provide again. But no matter what they are, things were wrong beyond belief. That is a given.
pavel_lishin
9 days ago
To be fair, I would be that nearly every organization employs a tremendous number of horrendous practices. We only gasp at the ones who get taken down for some reason.
internetter
9 days ago
Horrendous practices exist on a spectrum. Every org has bad code that somebody will fix someday™. It is reasonable to expect that after a catostrophic event like this, a full recovery takes some time. But at a "good" org, these practices are isolated. Not every org is entirely held together with masking tape. For the entire thing to be down for so long, the bad practices need to be widespread, seeping into every corner of the product. Ubiquitous.
For instance, when Cloudflare all went down a while ago due to a bad regex, it took less than a hour to rollback the changes. Undoubtably there were bad practices that lead to a regex having the ability to take everything out, but the problem was isolatable and once adressed partial service was quickly restored, and shortly after preventative measures were employed. This bug didn't destroy cloudflare for months.
P.S. in anticipation of the "but cloudflare has SLAs!!" that isn't really a distinction worth making because M&S has an implicit SLA with their customers — they are losing 40 million each week they can't offer service. Plenty of non-b2b companies that invest in quick recovery as well, like Netflix's monkey testing.
PaulHoule
9 days ago
No, best practice is that you have a checklist to bring up a copy of your system, better yet that checklist is "run a script". In the cloud age you ought to be able to bring a copy up in a new zone with a repeatable procedure.
Makes a big difference in developer quality of life and improves productivity right away. If you onboard a new dev you give them a checklist and they are up and running that day.
I had a coworker who taught me a lot about sysadmining, (social) networking, and vendor management. She told me that you'd better have your backup procedures tested. One time we were doing a software upgrade and I screwed up and dropped the Oracle database for a production system. She had a mirror in place so we had less than a minute of downtime.