Jetstream: Shrinking the AT Protocol Firehose by >99%

101 pointsposted 6 hours ago
by keybits

53 Comments

out_of_protocol

5 hours ago

Why providing non-compressed version at all? This is new protocol, no need for backwards compatibility. Dictionary may be baked into protocol itself, being fixed for specific version. E.g. protocol v1 uses that fixed v1 dictionary. Useful for replaying stored events on both sides

ericvolp12

an hour ago

Jetstream isn't an official change to the Protocol, it's an optimization I made for my own services that I realized a lot of other devs would appreciate. The major driving force behind it was both the bandwidth savings but also making the Firehose a lot easier to use for devs that aren't familiar with AT Proto and MSTs. Jetstream is a much more approachable way for people to dip their toe into my favorite part of AT Proto: the public event stream.

out_of_protocol

an hour ago

As I understand article, there are new API (unofficial , for now)

- Jetstream(1) (no compression) and

- Jetstream(2) (zstd compression).

And my comment means (1) not really needed, except some specific scenarios

ericvolp12

27 minutes ago

It's impossible to use the compressed version of the stream without using a client that has the baked-in ZSTD dictionary. This is a usability issue for folks using languages without a Jetstream client who just want to consume the websocket as JSON. It also makes things like using websocat and unix pipes to build some kind of automation a lot harder (though probably not impossible).

FWIW the default mode is uncompressed unless the client explicitly requests compression with a custom header. I tried using per-message-deflate but the support for it in the websocket libraries I was using was very poor and it has the same problem as streaming compression in terms of CPU usage on the Jetstream server.

slantedview

2 minutes ago

> It also makes things like using websocat and unix pipes to build some kind of automation a lot harder

Would anybody realistically be using those tools with this volume of data, for anything but testing?

jacoblambda

4 hours ago

A non-compressed version is almost certainly cheaper for anything local (ex self-hosting your own services that consume the firehose on the same machine or for testing).

There's not really a good reason to do compression if the stream is just going to be consumed locally. Instead you can skip that step and broadcast over memory to the other local services.

out_of_protocol

3 hours ago

It could be a flag, normally disabled. Also, i'm not sure about "cheaper" side, since disk ops are not free, maybe uncompressing zstd IS cheaper than writing, reading huge blobs from disk, exchanging info between apps

cowsandmilk

2 hours ago

This isn’t compression, they are throwing features of the original stream out.

out_of_protocol

an hour ago

We're discussing here compiled output, plain json and the same json but zstd-compressed

Ericson2314

4 hours ago

I gotta say, I am not very excited about "let's throw away all the security properties for performance!" (and also "CBOR is too hard!")

If everyone is on one server (remains to be seen), and all the bots blindly trust it because they are cheap and lazy, what the hell is the point?

skybrian

an hour ago

Centralization on trusted servers is going to happen but if they speak a common protocol, at least they can be swapped out. For JetStream, anyone can run an instance, though it will cost them more.

It’s sort of like the right to fork in Open Source; it doesn’t mean people fork all the time or verify every line of code themselves. There’s still trust involved.

I wonder if some security features could be added back, though?

hinkley

an hour ago

If you’re going to try data reduction and compression, always try compression first. It may reveal that the 10x reduction you were looking at is only 2x and not worth the trouble.

Reduction first may show the compression is less useful. Verbose, human friendly protocols compressed win out in maintenance tasks, and it’s a marathon not a sprint.

wmf

27 minutes ago

We've seen this over and over. If you do things the "right" way devs just don't show up because it's too much work.

evbogue

2 hours ago

Or why can't one verify a msg on it's own isolated from all of the other events on the PDS.

ericvolp12

an hour ago

The full Firehose provides two major verification features. First it includes a signature that can be validated letting you know the updates are signed by the repo owner. Second, by providing the MST proof, it makes it hard or impossible for the repo owner to omit any changes to the repo contents in the Firehose events. If some records are created or deleted without emitting events, the next event emitted will show that something's not right and you should re-sync your copy of the repo to understand what changed.

pohl

4 hours ago

The "bring it all home" screenshot shows a CPU Utilization graph, and the units of measurements on the vertical axis appears to be milliseconds. Could someone help me understand what that measurement might be?

anamexis

4 hours ago

The graph is labeled - CPU seconds per second.

pohl

4 hours ago

Missed that, thank you.

madduci

3 hours ago

Nice feat!

I wonder if a rewrite of this in C++ would even bump further the performance and optimise the overall system.

szundi

17 minutes ago

Or in rust haha

hinkley

an hour ago

Given that “AT Protocol” already has a definition in IT that’s as old as OP’s grandma, what is this AT Protocol they are talking about here?

Introduce your jargon before expositing, please.

marssaxman

an hour ago

I wondered something similar when I clicked the link: "who is still using enough AT commands that a compressed representation would matter, and how would you even DO that?" But this is clearly something else.

vardump

an hour ago

Anyone who writes software that uses GSM modems for example. Like in embedded systems.

marssaxman

7 minutes ago

Oh, for sure - I've done some of that myself - but I would never associate the word "firehose" with such low-powered systems!

jpm_sd

39 minutes ago

Iridium satellite modems too.

hinkley

37 minutes ago

I’m shocked those things are still up there. That project tried to fail so many times.

hinkley

39 minutes ago

And then I had to look up CBOR too, which at least is a thing Wikipedia has heard of. I mostly use compressed wire protocols and ignore the flavor of the month binary representations.

gs17

an hour ago

The article kind of assumes you know what it is in order to be interested in it, but it's the protocol used by Bluesky instead of ActivityPub.

hinkley

an hour ago

Which makes it a bad submission for HN. If you want exposure, prepare for it.

skrtskrt

2 minutes ago

you sound like you're a blast to be around

wmf

25 minutes ago

You mean the Hayes AT command set? We didn't call it a protocol back in the say.

bschmidt1

an hour ago

> Bluesky recently saw a massive spike in activity in response to Brazil’s ban of Twitter.

If Jack Dorsey being involved from the beginning isn't enough proof it's the same company as Twitter, the first line of this press release covers one of Twitter's latest PR stunts: "The Brazil Ban". Two things:

1. Twitter is intentionally being wound down because a platform that free (see: Arab Spring) cannot exist - it's seen only as a threat to the military-media complex. Know your politics - only those who were active in startups in ~2010-14 know this (Gaddafi, Aaron Swartz, Ross Ulbricht). Things have changed since then guys - tech lost. Elon Musk is an actor/figurehead, and every decision made at Twitter is intentional to kill the platform while scraping as many dollars out of it as they can (even at its peak as a free speech platform it was rarely profitable as a company). I don't believe Bluesky is getting a ton of traffic from Twitter as they hoped it would because Bluesky looks like a 3rd party Twitter app (junky af) and has the same employees and none of the free speech or interesting celebrities/influencers Twitter had at its peak. With press releases like this one mentioning the "Brazil Ban" it just seems like Bluesky is merely a vassal - another aspect of the winding down of Twitter. I consider Bluesky founders to be scam artists and liars.

2. Bluesky is a hoax of a decentralized service in the way that Ethereum is a hoax chain. It's not really decentralized - they have total control over the information flow and always have.

To me this news is just like Ethereum's "The Merge" which also promised lower bandwidth, energy usage - computing. But it did this by cutting out most of the public contributors to the ledger and allowing only a few to contribute - thereby centralizing it and defeating the entire purpose to begin with.

Bluesky is a hoax, can't believe anyone falls for this junk. Their app looks terrible too. Innovate, make something people want.

S0y

2 hours ago

>Before this new surge in activity, the firehose would produce around 24 GB/day of traffic. After the surge, this volume jumped to over 232 GB/day!

>Jetstream is a streaming service that consumes an AT Proto com.atproto.sync.subscribeRepos stream and converts it into lightweight, friendly JSON.

So let me get this straight. if you did want to run Jetstream yourself you'd still need to be able to handle the 232 GB/day of bandwidth?

This always has been my issue with Bluesky/AT Protocol, For all the talk about their protocol being federated, It really doesn't seem realistic for anyone to run any of the infrastructure themselves. You're always going to be reliant on a big player that has the capital to keep everything running smoothly. At this point I don't really see how it's any different then being on any of the old centralized social media.

pfraze

2 hours ago

Old social media never gave full access to the firehose so there’s a pretty big difference.

If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.

If you want a smaller scale view of the network, do a crawl of a subset of the users. That’s a perfectly valid usage of atproto, and is how ActivityPub works by nature.

S0y

41 minutes ago

>Old social media never gave full access to the firehose so there’s a pretty big difference.

That is good, but it's still a centralized source of truth.

>If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.

Thats just simply not true. ActivityPub does perfectly without the need of any bulky machine or node acting as a relay for the rest of the network. Every single ActivityPub service only ever interacts with other discovered services. Messages aren't broadcast through a central firehose, they're sent directly to who needs to receive them. This is a fundamental difference with how both protocols work. With ATProto you NEED to connect to some centralized relay that will broker your messages for you. With ActivityPub, there is no middle man, Instances just talk directly to each other. This is why ActivityPub has a discovery problem by the way, but it's just a symptom of real federation.

>and is how ActivityPub works by nature.

It's not. See Above.

PhilippGille

2 hours ago

Based on the article OP runs his Jetstream instance with 12 consumers (subsets of the full stream if I understand correctly) on a $5 VPS on OVH

CyberDildonics

2 hours ago

That's only about 2.7 MB/s on average.

If someone wants to run a server, they probably would pay for a VPS with a gigabit connection, which would be able to do 120 MB/s.

You might need to pay for extra bandwidth, but it is probably less than a night out every month.

gooseus

5 hours ago

I thought this was going to be about NATS Jetstream, but it is not.

https://docs.nats.io/nats-concepts/jetstream

Kinrany

2 hours ago

I thought this was about BlueSky using NATS!

fakwandi_priv

4 hours ago

Why is this being downvoted? Seems like a valid concern to raise if you find two pieces of software somewhat having the same functionality.

kylecazar

4 hours ago

It has the same name, not the same functionality. I am not a downvoter... but it's probably because reading a few sentences of this blog post would reveal what it is.

dboreham

4 hours ago

Perhaps the complaint is more about project namers spending zero time checking for uniqueness.

gs17

an hour ago

A problem shared with AT as well.

xbar

2 hours ago

I'm never not going to look for Hayes command set topics when people talk about BlueSky.

gs17

2 hours ago

You're not the only one. I don't get why they couldn't have named it something that wasn't very similar to something already around for several decades, or at least insist on the shortened ATproto name (one word, lower case p). Sure, in practice, no one will actually confuse them, but that could be said for Java and JavaScript.

scirob

4 hours ago

Was expecting Nats Jetstream but this is also cool

vundercind

4 hours ago

Was expecting the Hayes modem command language.

JoshMandel

4 hours ago

Server-Sent Events (SSE) with standard gzip compression could be a simpler solution -- or maybe I'm missing something about the websocket + zstd approach.

SSE Benefits: Standard HTTP protocol, Built-in gzip compression, Simpler client implementation

jeroenhd

4 hours ago

Well-configured zstd can save a lot of bandwidth over gzip at this scale without major performance impact, especially with the custom dictionary. Initialising zstd with a custom dictionary also isn't very difficult for the client side.

As for application development, I think web socket APIs are generally exposed much better and used much easier than SSEs. I agree that SSEs are a more appropriate technology to use here, but they're used so little that I don't think the tooling is good. Just about every language has a dedicated websocket client library, but SSEs are usually implemented as a weird side effect of a HTTP connection you need to keep alive manually.

The stored ZSTD objects make sense, as you only need to compress once rather than compress for every stream (as the author details). It also helps store the data collected more efficiently on the server side if that's what you want to do.

qixxiq

4 hours ago

I don't have an understanding of SSE in depth, but one of the points the post is arguing for is compress once (using zstd dictionary) and send that to every client.

The dictionary allows for better compression without needing a large amount of data, and sending every client the same compressed binary data saves a lot of CPU time in compression. Streams, usually, require running the compression for each client.