Kamal Proxy – A minimal HTTP proxy for zero-downtime deployments

111 pointsposted 8 hours ago
by norbert1990

57 Comments

ksajadi

8 minutes ago

This primarily exists to take care of a fundamental issue in Docker Swarm (Kamal's orchestrator of choice) where replacing containers of a service disrupts traffic. We had the same problem (when building JAMStack servers at Cloud 66) and used Caddy instead of writing our own proxy and also looked at Traefik which would have been just as suitable.

I don't know why Kamal chose Swarm over k8s or k3s (simplicity perhaps?) but then, complexity needs a home, you can push it around but cannot hide it, hence a home grown proxy.

I have not tried Kamal proxy to know, but I am highly skeptical of something like this, because I am pretty sure I will be chasing it for support for anything from WebSockets to SSE, to HTTP/3 to various types of compression and encryption.

viraptor

38 minutes ago

It's an interesting choice to make this a whole app, when the zero-downtime deployments can be achieved with other servers trivially these days. For example any app+web proxy which supports Unix sockets can do zero-downtime by moving the file. It's atomic and you can send the warm-up requests with curl. Building a whole system with registration feels like an overkill.

000ooo000

5 hours ago

Strange choice of language for the actions:

>To route traffic through the proxy to a web application, you *deploy* instances of the application to the proxy. *Deploying* an instance makes it available to the proxy, and replaces the instance it was using before (if any).

>e.g. `kamal-proxy deploy service1 --target web-1:3000`

'Deploy' is a fairly overloaded term already. Fun conversations ahead. Is the app deployed? Yes? No I mean is it deployed to the proxy? Hmm our Kamal proxy script is gonna need some changes and a redeployment so that it deploys the deployed apps to the proxy correctly.

Unsure why they couldn't have picked something like 'bind', or 'intercept', or even just 'proxy'... why 'deploy'..

nahimn

2 hours ago

“Yo dawg, i heard you like deployments, so we deployed a deployment in your deployment so your deployment can deploy” -Xzibit

8organicbits

2 hours ago

If your ingress traffic comes from a proxy, what would deploy mean other than that traffic from the proxy is now flowing to the new app instance?

irundebian

4 hours ago

Deployed = running and registered to the proxy.

vorticalbox

an hour ago

Then wouldn’t “register” be a better term?

blue_pants

4 hours ago

Can someone briefly explain how ZDD works in general?

I guess both versions of the app must be running simultaneously, with new traffic being routed to the new version of the app.

But what about DB migrations? Assuming the app uses a single database, and the new version of the app introduces changes to the DB schema, the new app version would modify the schema during startup via a migration script. However, the previous version of the app still expects the old schema. How is that handled?

diggan

3 hours ago

First step is to decouple migrations from deploys, you want manual control over when the migrations run, contrary to many frameworks default of running migrations when you deploy the code.

Secondly, each code version has to work with the current schema and the schema after a future migration, making all code effectively backwards compatible.

Your deploys end up being something like:

- Deploy new code that works with current and future schema

- Verify everything still works

- Run migrations

- Verify everything still works

- Clean up the acquired technical debt (the code that worked with the schema that no longer exists) at some point, or run out of runway and it won't be an issue

svvvy

23 minutes ago

I thought it was correct to run the DB migrations for the new code first, then deploy the new code. While making sure that the DB schema is backwards compatible with both versions of the code that will be running during the deployment.

So maybe there's something I'm missing about running DB migrations after the new code has been deployed - could you explain?

ffsm8

a minute ago

I'm not the person you've asked, but I've worked in devops before.

It kinda doesn't matter which you do first. And if you squint a little, it's effectively the same thing, because the migration will likely only become available via a deployment too

So yeah, the only things that's important is that the DB migration can't cause an incompatibility with either version of the code - and if it would, you'll have to split the change between live deployments so it doesn't.

wejick

2 hours ago

This is very good explanation, no judgment and simply educational. Appreciated

Though I'm still surprised that some people run DB alteration on application start up. Never saw one in real life.

diggan

2 hours ago

> Though I'm still surprised that some people run DB alteration on application start up

I think I've seen it more commonly in the Golang ecosystem, for some reason. Also not sure how common it is nowadays, but seen lots of deployments (contained in Ansible scripts, Makefiles, Bash scripts or whatever) where the migration+deploy is run directly in sequence automatically for each deploy, rather than as discrete steps.

Edit: Maybe it's more of an educational problem than something else, where learning resources either don't specify when to actually run migrations or straight up recommend people to run migrations on application startup (one example: https://articles.wesionary.team/integrating-migration-tool-i...)

miki123211

2 hours ago

It makes things somewhat easier if your app is smallish and your workflow is something like e.g. Github Actions automatically deploying all commits on main to Fly or Render.

andrejguran

3 hours ago

Migrations have to be backwards compatible so the DB schema can serve both versions of the app. It's an extra price to pay for having ZDD or rolling deployments and something to keep in mind. But it's generally done by all the larger companies

efortis

3 hours ago

Yes, both versions must be running at some point.

The load balancer starts accepting connections on Server2 and stops accepting new connections on Server1. Then, Server1 disconnects when all of its connections are closed.

It could be different Servers or multiple Workers on one server.

During that window, as the other comments said, migrations have to be backwards compatible.

stephenr

3 hours ago

Others have described the how part if you do need truly zero downtime deployments, but I think it's worth pointing out that for most organisations, and most migrations, the amount of downtime due to a db migration is virtually indistinguishable from zero, particularly if you have a regional audience, and can aim for "quiet" hours to perform deployments.

diggan

2 hours ago

> the amount of downtime due to a db migration is virtually indistinguishable from zero

Besides, once you've run a service for a while that has acquired enough data for migrations to take a while, you realize that there are in fact two different types of migrations. "Schema migrations" which are generally fast and "Data migrations" that depending on the amount of data can take seconds or days. Or you can do the "data migrations" when needed (on the fly) instead of processing all the data. Can get gnarly quickly though.

Splitting those also allows you to reduce maintenance downtime if you don't have zero-downtime deployments already.

stephenr

2 hours ago

Very much so, we handle these very differently for $client.

Schema migrations are versioned in git with the app, with up/down (or forward/reverse) migration scripts and are applied automatically during deployment of the associated code change to a given environment.

SQL Data migrations are stored in git so we have a record but are never applied automatically, always manually.

The other thing we've used along these lines, is having one or more low priority job(s) added to a queue, to apply some kind of change to records. These are essentially still data migrations, but they're written as part of the application code base (as a Job) rather than in SQL.

kh_hk

4 hours ago

I don't understand how to use this, maybe I am missing something.

Following the example, it starts 4 replicas of a 'web' service. You can create a service by running a deploy to one of the replicas, let's say example-web-1. What does the other 3 replicas do?

Now, let's say I update 'web'. Let's assume I want to do a zero-downtime deployment. That means I should be able to run a build command on the 'web' service, start this service somehow (maybe by adding an extra replica), and then run a deploy against the new target?

If I run a `docker compose up --build --force-recreate web` this will bring down the old replica, turning everything moot.

Instructions unclear, can anyone chime in and help me understand?

sisk

16 minutes ago

For the first part of your question about the other replicas, docker will load balance between all of the replicas either with a VIP or by returning multiple IPs in the DNS request[0]. I didn't check if this proxy balances across multiple records returned in a DNS request but, at least in the case of VIP-based load balancing, should work like you would expect.

For the second part about updating the service, I'm a little less clear. I guess the expectation would be to bring up a differently-named service within the same network, and then `kamal-proxy deploy` it? So maybe the expectation is for service names to include a version number? Keeping the old version hot makes sense if you want to quickly be able to route back to it.

[0]: https://docs.docker.com/reference/compose-file/deploy/#endpo...

thelastparadise

3 hours ago

Why would I not just do k8s rollout restart deployment?

Or just switch my DNS or router between two backends?

ozgune

3 minutes ago

I think the parent project, Kamal, positions itself as a simpler alternative to K8s when deploying web apps. They have a question on this on their website: https://kamal-deploy.org

"Why not just run Capistrano, Kubernetes or Docker Swarm?

...

Docker Swarm is much simpler than Kubernetes, but it’s still built on the same declarative model that uses state reconciliation. Kamal is intentionally designed around imperative commands, like Capistrano.

Ultimately, there are a myriad of ways to deploy web apps, but this is the toolkit we’ve used at 37signals to bring HEY and all our other formerly cloud-hosted applications home to our own hardware."

joeatwork

3 hours ago

I think this is part of a lighter weight Kubernetes alternative.

ianpurton

3 hours ago

Lighter than the existing light weight kubernetes alternatives i.e. k3s :)

diggan

3 hours ago

Or, hear me out: Kubernetes alternatives that don't involve any parts of Kubernetes at all :)

jgalt212

3 hours ago

You still need some warm-up routine to run for the newly online server before the hand-off occurs. I'm not a k8s expert, but the above described events can be easily handled by a bash or fab script.

thelastparadise

6 minutes ago

This is a health/readiness probe in k8s. It's already solved quite solidly.

ahoka

2 hours ago

What events do you mean? If the app needs a warm up, then it can use its readiness probe to ask for some delay until it gets request routed to it.

jgalt212

2 hours ago

GET requests to pages that fill caches or those that make apache start up more than n processes.

0xblinq

an hour ago

3 years from now they'll have invented their own AWS. NIH syndrome in full swing.

bdcravens

30 minutes ago

It's a matter of cost, not NIH syndrome. In Basecamp's case, saving $2.4M a year isn't something to ignore.

https://basecamp.com/cloud-exit

Of course, it's fair to say that rebuilding the components that the industry uses for hosting on bare metal is NIH syndrome.

moondev

3 hours ago

Does this handle a host reboot?

risyachka

3 hours ago

Did they mention anywhere why they decided to write their own proxy instead of using Traefik or something else battle tested?

ianpurton

3 hours ago

DHH in the past has said "This setup helped us dodge the complexity of Kubernetes"

But this looks like somehow a re-invention of what Kubernetes provides.

Kubernetes has come a long way in terms of ease of deployment on bare metal.

wejick

3 hours ago

No downtime deployment is always there long before kube. It does look as simple as ever been, not like kube for sure.

oDot

5 hours ago

DHH mentioned they built it to move from the cloud to bare metal. He glorifies the simplicity but I can't help thinking they are a special use case of predictable, non-huge load.

Uber, for example, moved to the cloud. I feel like in the span between them there are far more companies for which Kamal is not enough.

I hope I'm wrong, though. It'll be nice for many companies to be have the choice of exiting the cloud.

martinald

4 hours ago

I don't think that's the real point. The real point is that 'big 3' cloud providers are so overpriced that you could run hugely over provisioned infra 24/7 for your load (to cope with any spikes) and still save a fortune.

The other thing is that cloud hardware is generally very very slow and many engineers don't seem to appreciate how bad it is. Slow single thread performance because of using the most parallel CPUs possible (which are the cheapest per W for the hyperscalers), very poor IO speeds, etc.

So often a lot of this devops/infra work is solved by just using much faster hardware. If you have a fairly IO heavy workload then switching from slow storage to PCIe4 7gbyte/sec NVMe drives is going to solve so many problems. If your app can't do much work in parallel then CPUs with much faster single threading performance can have huge gains.

miki123211

an hour ago

You can always buy some servers to handle your base load, and then get extra cloud instances when needed.

If you're running an ecommerce store for example, you could buy some extra capacity from AWS for Christmas and Black Friday, and rely on your own servers exclusively for the rest of the year.

jsheard

2 hours ago

It's sad that what should have been a huge efficiency win, amortizing hardware costs across many customers, ended up often being more expensive than just buying big servers and letting them idle most of the time. Not to say the efficiency isn't there, but the cloud providers are pocketing the savings.

toomuchtodo

2 hours ago

If you want a compute co-op, build a co-op (think VCs building their own GPU compute clusters for portfolio companies). Public cloud was always about using marketing and the illusion of need for dev velocity (which is real, hypergrowth startups and such, just not nearly as prevalent as the zeitgeist would have you believe) to justify the eye watering profit margin.

Most businesses have fairly predictable interactive workload patterns, and their batch jobs are not high priority and can be managed as such (with the usual scheduling and bin packing orchestration). Wikipedia is one of the top 10 visited sites on the internet, and they run in their own datacenter, for example. The FedNow instant payment system the Federal Reserve recently went live with still runs on a mainframe. Bank of America was saving $2B a year running their own internal cloud (although I have heard they are making an attempt to try to move to a public cloud).

My hot take is public cloud was an artifact of ZIRP and cheap money, where speed and scale were paramount, cost being an afterthought (Russ Hanneman pre-revenue bit here, "get big fast and sell"; great fit for cloud). With that macro over, and profitability over growth being the go forward MO, the equation might change. Too early to tell imho. Public cloud margins are compute customer opportunities.

miki123211

an hour ago

Wikipedia is often brought up in these discussions, but it's a really bad example.

To a vast majority of Wikipedia users who are not logged in, all it needs to do is show (potentially pre-rendered) article pages with no dynamic, per-user content. Those pages are easy to cache or even offload to a CDN. FOr all the users care, it could be a giant key-value store, mapping article slugs to HTML pages.

This simplicity allows them to keep costs down, and the low costs mean that they don't have to be a business and care about time-on-page, personalized article recommendations or advertising.

Other kinds of apps (like social media or messaging) have very different usage patterns and can't use this kind of structure.

toomuchtodo

an hour ago

> Other kinds of apps (like social media or messaging) have very different usage patterns and can't use this kind of structure.

Reddit can’t turn a profit, Signal is in financial peril. Meta runs their own data centers. WhatsApp could handle ~3M open TCP connections per server, running the operation with under 300 servers [1] and serving ~200M users. StackOverflow was running their Q&A platform off of 9 on prem servers as of 2022 [2]. Can you make a profitable business out of the expensive complex machine? That is rare, based on the evidence. If you’re not a business, you’re better off on Hetzner (or some other dedicated server provider) boxes with backups. If you’re down you’re down, you’ll be back up shortly. Downtime is cheaper than five 9s or whatever.

I’m not saying “cloud bad,” I’m saying cloud where it makes sense. And those use cases are the exception, not the rule. If you're not scaling to an event where you can dump these cloud costs on someone else (acquisition event), or pay for them yourself (either donations, profitability, or wealthy benefactor), then it's pointless. It's techno performance art or fancy make work, depending on your perspective.

[1] https://news.ycombinator.com/item?id=33710911

[2] https://www.datacenterdynamics.com/en/news/stack-overflow-st...

olieidel

5 hours ago

> I feel like in the span between them there are far more companies for which Kamal is not enough.

I feel like this is a bias in the HN bubble: In the real world, 99% of companies with any sort of web servers (cloud or otherwise) are running very boring, constant, non-Uber workloads.

ksec

2 hours ago

Not just HN but overall the whole internet. Because all the news and article, tech achievements are pumped out from Uber and other big tech companies.

I am pretty sure Uber belongs to the 1% of the internet companies in terms of scale. 37Signals isn't exactly small either. They spend $3M a year on infrastructure in 2019. Likely a lot higher now.

The whole Tech cycle needs to stop having a top down approach where everyone are doing what Big tech are using. Instead we should try to push the simplest tool from low end all the way to 95% mark.

nchmy

2 hours ago

They spend considerably less on infra now - this was the entire point of moving off cloud. DHH has written and spoken lots about it, providing real numbers. They bought their own servers and the savings paid for it all in like 6 months. Now its just money in the bank til they replace the hardware in 5 years.

Cloud is a scam for the vast majority of companies.

toberoni

4 hours ago

I feel Uber is the outlier here. For every unicorn company there are 1000s of companies that don't need to scale to millions of users.

And due to the insane markup of many cloud services it can make sense to just use beefier servers 24/7 to deal with the peaks. From my experience crazy traffic outliers that need sophisticated auto-scaling rarely happens outside of VC-fueled growth trajectories.

appendix-rock

5 hours ago

You can’t talk about typical cases and then bring up Uber.

pinkgolem

5 hours ago

I mean most B2B company have a pretty predictable load when providing services to employees..

I can get weeks advance notice before we have a load increase through new users