I solved a distributed queue problem after 15 years

36 pointsposted 5 days ago
by jedberg

19 Comments

na4ma4

2 days ago

The article is interesting, but what does the title have to do with it ?

> What we really needed to make distributed task queueing robust are durable queues ...

> Durable queues were rare when I was at Reddit, but they’re more and more popular now.

It sounds like the answer was known at the time but there wasn't the resources to solve it ?

jedberg

2 days ago

Back then, you couldn't build a performant queue in a database without a huge number of resources, both hardware wise and people wise. That's why only huge enterprises were using them then.

Only recently (last decade or so), has the performance of an open source database on modest hardware caught up to the alternatives.

belZaah

2 days ago

I was architecture team lead at Skype 2005-2011 and persistent queues were the only ones around. Basically, because they knew how to scale databases, our DBA team (Hannu Krosing comes to mind) built queues into and on top of Postgres. It happened not just at Skype, too: eBay guys (Dan Prichett in particular) had built a very cool Oracle-based queue solution. Scaling persistent queues seems to be something that needs to be reinvented periodically. Maybe it’s too nuanced of a problem?

jedberg

2 days ago

I actually worked on security for Skype in 2005/2006. And I knew Dan at eBay too.

Using databases for queues isn't a new idea, but a couple things make it different now. One is that I think this is the first time a solution has been open-sourced (not 100% sure on that).

And two, Postgres added SKIP LOCKED in 9.5 (around 2016), which was the performance unlock to make this work without needing a whole DBA team like Skype had.

fmajid

a day ago

Skytools/PgQ was amazing technology that made clever use of Postgres MVCC to provide a high performance transactional message queue. But yes, it was not low-touch to administer.

arthurcolle

2 days ago

is this referring to Dan Tarman?

sshine

11 hours ago

GP wrote Dan Prichett.

BSVogler

2 days ago

This article resonated with my experience building what was essentially a distributed task queue using Redis+PostgreSQL with Python workers in Kubernetes. It seems like these systems naturally evolve different patterns based on their specific use cases. The logic of our queue was intertwined with a rule engine. I wrote about building a rule engine here [1]. Another difference to this article is that it did not report back to the client as the events were delivered via web hooks.

There are some different approaches here and there which come from making it application specific, e.g. we added a periodic reconciliation check. I also built a debouncer into the queue to give special treatment to burst in the load.

[1] https://blog.benediktsvogler.com/blog/building-a-distributed...

asplake

2 days ago

I know from firsthand experience that there were investment banks doing this at least as early as the mid 1990s - on top of Sybase in the particular case I have in mind. Perhaps not the best performance, but it was trivially easy to inspect, and you could integrate with other database features, for example stored procedures.

fmajid

a day ago

Back when I worked for France Telecom circa 1996–99, Sybase Replication Server was the fastest message queue we tested for transporting billing events, by a wide margin, in a way that predated Postgres logical replication by 3 decades.

noident

2 days ago

Hi, OP. Any thoughts on why to prefer using your OLTP for queueing as opposed to Kafka? In my mind it would be a drop in ease of observability (though there is still KSQL and a number of UI wrappers) in exchange for much better performance. I'm interested if you had other reasons.

KraftyOne

a day ago

Observability is a big advantage, another advantage (in the context of DBOS specifically) is integration with durable workflows, so you can write a large workflow that enqueues and manages many smaller tasks.

redditor98654

2 days ago

From what I recall, Reddit uses AWS extensively. Could they not have replaced RabbitMQ with SQS? You get the near unlimited horizontal scalability, extremely good uptime, guaranteed at least once message delivery and for the case of a worker crash, the messages will become visible again after the visibility timeout (since they wouldn’t have been deleted by the worker).

jedberg

2 days ago

SQS could not handle the volume or latency requirements we had, and it was too expensive compared to running it ourselves at the time.

4ndrewl

2 days ago

I think there is a hard limit on the number of in-flight requests (that is items that have been dequeued by a worker, but whose job has not been completed). I wouldn't be surprised if Reddit hit those sorts of volumes.

_1tan

10 hours ago

We're a MySQL/Hibernate shop. Anything similar possible with that stack?

yencabulator

a day ago

Utterly content-free marketing blogspam.