ksajadi
2 days ago
I’ve been building and deploying thousands of stacks on first Docker, then Mesos, then Swarm and now k8s. If I have learned one thing from it, it’s this: it’s all about the second day.
There are so many tools that make it easy to build and deploy apps to your servers (with or without containers) and all of them showcase how easy it is to go from a cloud account to a fully deploy app.
While their claims are true, what they don’t talk about is how to maintain the stack, after “reclaiming” it. Version changes, breaking changes, dependency changes and missing dependencies, disaster recovery plans, backups and restores, major shifts in requirements all add up to a large portion of your time.
If you have that kind of team, budget or problem that deserves those, then more power to you.
AnAnonyCowherd
2 days ago
> If you have that kind of team, budget or problem that deserves those, then more power to you.
This is the operative issue, and it drives me crazy. Companies that can afford to deploy thousands of services in the cloud definitely have the resources to develop in-house talent for hosting all of that on-prem, and saving millions per year. However, middle management in the Fortune 500 has been indoctrinated by the religion that you take your advice from consultants and push everything to third parties so that 1) you build your "kingdom" with terribly wasteful budget, and 2) you can never be blamed if something goes wrong.
As a perfect example, in my Fortune 250, we have created a whole new department to figure out what we can do with AI. Rather than spend any effort to develop in-house expertise with a new technology that MANY of us recognize could revolutionize our engineering workflow... we're buying Palatir's GenAI product, and using it to... optimize plant safety. Whatever you know about AI, it's fundamentally based on statistics, and I simply can't imagine a worse application than trying to find patterns in data that BY DEFINITION is all outliers. I literally can't even.
You smack your forehead, and wonder why the people at the top, making millions in TC, can't understand such basic things, but after years of seeing these kinds of short-sighted, wasteful, foolish decisions, you begin to understand that improving the company's abilities, and making it competitive for the future is not the point. What is the point "is an exercise left to the reader."
tempodox
2 hours ago
> we have created a whole new department to figure out what we can do with AI.
Wow, this is literally the solution in search of a problem.
wg0
2 days ago
This is absolutely true. I can count easily some 20+ components already.
So this is not walk in the park with two willing developers to learn k8s.
The underlying apps (Redis, ES) will have version upgrades.
Their respective operators themselves would have version upgrades.
Essential networking fabric (calico, funnel and such) would have upgrades.
The underlying kubernetes itself would have version upgrades.
The Talos Linux itself might need upgrades.
Of all the above, any single upgrade might lead to infamous controller crash loop where pod starts and dies with little to no indication as to why? And that too no ordinary pod but a crucial pod part of some operator supposed to do the housekeeping for you.
k8s is invented at Google and is more suitable in ZIRP world where money is cheap and to change the logo, you have seven designers on payroll discussing for eight months how nine different tones of brand coloring might convey ten different subliminal messages.
imiric
2 days ago
> The underlying apps (Redis, ES) will have version upgrades.
You would have to deal with those with or without k8s. I would argue that without it is much more painful.
> Their respective operators themselves would have version upgrades. > > Essential networking fabric (calico, funnel and such) would have upgrades. > > The underlying kubernetes itself would have version upgrades. > > The Talos Linux itself might need upgrades.
How is this different from regular system upgrades you would have to do without k8s?
K8s does add layers on top that you also have to manage, but it solves a bunch of problems in return that you would have to solve by yourself one way or another.
That essential networking fabric gives you a service mesh for free, that allows you to easily deploy, scale, load balance and manage traffic across your entire infrastructure. Building that yourself would take many person-hours and large teams to maintain, whereas k8s allows you to run this with a fraction of the effort and much smaller teams in comparison.
Oh, you don't need any of that? Great. But I would wager you'll find that the hodge podge solution you build and have to maintain years from now will take much more of your time and effort than if you had chosen an industry standard. By that point just switching would be a monumental effort.
> Of all the above, any single upgrade might lead to infamous controller crash loop where pod starts and dies with little to no indication as to why?
Failures and bugs are inevitable. Have you ever had to deal with a Linux kernel bug?
The modern stack is complex enough as it is, and while I'm not vouching for increasing it, if those additional components solve major problems for me, and they become an industry standard, then it would be foolish to go against the grain and reinvent each component once I have a need for it.
mplewis
2 days ago
You seem to be misunderstanding. The components that add complexity in this case do not come from running a k8s cluster. They come from the Reclaim the Stack software.
imiric
2 days ago
Alright. So let's discuss how much time and effort it would take to build and maintain a Heroku replacement without k8s then.
Besides, GP's criticisms were squarely directed at k8s. For any non-trivial workloads, you will likely use operators and networking plugins. Any of these can have bugs, and will add complexity to the system. My point is that if you find any of those features valuable, then the overall cost would be much less than the alternatives.
Maxion
2 days ago
The alternative is not to build a different PaaS alternative, but to simply pay Heroku/AWS/Google/Indie PaaS providers and go back to making your core product.
imiric
2 days ago
Did you read the reasons they moved away from Heroku to begin with? Clearly what you mention wasn't an option for them, and they consider this project a success.
sgarland
2 days ago
Talos is an immutable OS; upgrades are painless and roll themselves back upon failure. Same thing for K8s under Talos (the only thing Talos does is run K8s).
specialist
2 days ago
TIL "immutable OS", thanks.
Ages ago, I had the notion of booting from removable read-only media. At the time CD-ROM. Like gear for casting and tabulating votes. Or controllers for critical infra.
(Of course, a device's bootloader would have to be ROM too. And boot images would be signed, both digitally and manually.)
Maybe "immutable boot" and immutable OS can be complimentary. Surely someone's already explored this (obv idea). Worth pondering.
benjaminwootton
2 days ago
The flip side of this is the cost. Managed cloud services make it faster to get live, but then you are left paying managed service providers for years.
I’ve always been a big cloud/managed service guy, but the costs are getting astronomical and I agree the buy vs build of the stack needs a re-evaluation.
Maxion
2 days ago
This is the balance, right? For the vast majority of web apps et. al. the cloud costs are going to be cheaper than having full-time Ops people managing an OSS stack on VPS / Bare Metal.
szundi
2 days ago
And what is your take on all those things that you tried? Some experience/examples would benefit us probably.
bsenftner
2 days ago
The thing that strikes me is: okay, two "willing developers" - but they need to be actually capable, not just "willing" but "experienced and able" and that lands you at a minimum of $100k per year per engineer. That means this system has a maintenance cost of over $16K per month, if you have to dedicate two engineers full to the maintenance, and of course following the dynamic nature of K8s and all their tooling just to stay in front of all of that.
0perator
2 days ago
Also, for only two k8s devops engineers in a 24h-available world, you’re gonna be running them ragged with 12h solo shifts or taking the risk of not staffing overnight. Considering most update and backup jobs kick off at midnight, that’s a huge risk.
If I were putting together a minimum-viable staffing for a 24x7 available cluster with SLAs on RPO and RTO, I’d be recommending much more than two engineers. I’d probably be recommending closer to five: one senior engineer and one junior for the 8-4 shift, a engineer for the 4-12 shift, another engineer for the 12-8 shift, and another junior who straddles the evening and night shifts. For major outages, this still requires on-call time from all of the engineers, and additional staffing may be necessary to offset overtime hours. Given your metric of roughly $8k an engineer, we’d be looking at a cool $40K/month in labour just to approach four or five 9s of availability.
oldprogrammer2
2 days ago
Even worse, this feels like the goal was actually about reclaiming their resumes, not the stack. I expect these two guys to jump ship within a year, leaving the rest of the team trying to take care of an entire ecosystem they didn't build.
Maxion
2 days ago
And you may still end up with longer downtime if SHTF than if you use a managed provider.
tomwojcik
2 days ago
Agreed. Forgive a minor digression, but what OP wrote is my problem now. I'm looking for something like heroku's or fly's release command. I have an idea how to implement it in docker using swarm, but I can't figure out how to do that on k8s. I googled it some time ago, but all the answers were hacks.
Would someone be able to recommend an approach that's not a hack, for implementing a custom release command on k8s? Downtime is fine, but this one off job needs to run before the user facing pods are available.
psini
2 days ago
Look at helm charts, they have become the de facto standard for packaging/distributing/deploying/updating whole apps on Kubernetes
trashburger
2 days ago
https://leebriggs.co.uk/blog/2019/02/07/why-are-we-templatin...
Something like Jsonnet would serve one better, I think. The only part that kinda sucks is the "package management" but that's a small price to pay to avoid the YAML insanity. Helm is fine for consuming third-party packages.
imiric
2 days ago
Agreed, but to be fair, those are general problems you would face with any architecture. At least with mainstream stacks you get the benefit of community support, and relying on approaches that someone else has figured out. Container-based stacks also have the benefit of homogeneizing your infrastructure, and giving you a common set of APIs and workflows to interact with.
K8s et al are not a silver bullet, but at this point they're highly stable and understood pieces of infrastructure. It's much more painful to deviate from this and build things from scratch, deluding yourself that your approach can be simpler. For trivial and experimental workloads that may be the case, but for anything that requires a bit more sophistication these tools end up saving you resources in the long run.
sedatk
2 days ago
> it’s all about the second day
Tangentially, I think this applies to LLMs too.