Why providing non-compressed version at all? This is new protocol, no need for backwards compatibility. Dictionary may be baked into protocol itself, being fixed for specific version. E.g. protocol v1 uses that fixed v1 dictionary. Useful for replaying stored events on both sides
Jetstream isn't an official change to the Protocol, it's an optimization I made for my own services that I realized a lot of other devs would appreciate. The major driving force behind it was both the bandwidth savings but also making the Firehose a lot easier to use for devs that aren't familiar with AT Proto and MSTs. Jetstream is a much more approachable way for people to dip their toe into my favorite part of AT Proto: the public event stream.
As I understand article, there are new API (unofficial , for now)
- Jetstream(1) (no compression) and
- Jetstream(2) (zstd compression).
And my comment means (1) not really needed, except some specific scenarios
It's impossible to use the compressed version of the stream without using a client that has the baked-in ZSTD dictionary. This is a usability issue for folks using languages without a Jetstream client who just want to consume the websocket as JSON. It also makes things like using websocat and unix pipes to build some kind of automation a lot harder (though probably not impossible).
FWIW the default mode is uncompressed unless the client explicitly requests compression with a custom header. I tried using per-message-deflate but the support for it in the websocket libraries I was using was very poor and it has the same problem as streaming compression in terms of CPU usage on the Jetstream server.
> It also makes things like using websocat and unix pipes to build some kind of automation a lot harder
Would anybody realistically be using those tools with this volume of data, for anything but testing?
A non-compressed version is almost certainly cheaper for anything local (ex self-hosting your own services that consume the firehose on the same machine or for testing).
There's not really a good reason to do compression if the stream is just going to be consumed locally. Instead you can skip that step and broadcast over memory to the other local services.
It could be a flag, normally disabled. Also, i'm not sure about "cheaper" side, since disk ops are not free, maybe uncompressing zstd IS cheaper than writing, reading huge blobs from disk, exchanging info between apps
This isn’t compression, they are throwing features of the original stream out.
We're discussing here compiled output, plain json and the same json but zstd-compressed
I gotta say, I am not very excited about "let's throw away all the security properties for performance!" (and also "CBOR is too hard!")
If everyone is on one server (remains to be seen), and all the bots blindly trust it because they are cheap and lazy, what the hell is the point?
Centralization on trusted servers is going to happen but if they speak a common protocol, at least they can be swapped out. For JetStream, anyone can run an instance, though it will cost them more.
It’s sort of like the right to fork in Open Source; it doesn’t mean people fork all the time or verify every line of code themselves. There’s still trust involved.
I wonder if some security features could be added back, though?
If you’re going to try data reduction and compression, always try compression first. It may reveal that the 10x reduction you were looking at is only 2x and not worth the trouble.
Reduction first may show the compression is less useful. Verbose, human friendly protocols compressed win out in maintenance tasks, and it’s a marathon not a sprint.
We've seen this over and over. If you do things the "right" way devs just don't show up because it's too much work.
Or why can't one verify a msg on it's own isolated from all of the other events on the PDS.
The full Firehose provides two major verification features. First it includes a signature that can be validated letting you know the updates are signed by the repo owner. Second, by providing the MST proof, it makes it hard or impossible for the repo owner to omit any changes to the repo contents in the Firehose events. If some records are created or deleted without emitting events, the next event emitted will show that something's not right and you should re-sync your copy of the repo to understand what changed.
The "bring it all home" screenshot shows a CPU Utilization graph, and the units of measurements on the vertical axis appears to be milliseconds. Could someone help me understand what that measurement might be?
The graph is labeled - CPU seconds per second.
Nice feat!
I wonder if a rewrite of this in C++ would even bump further the performance and optimise the overall system.
Given that “AT Protocol” already has a definition in IT that’s as old as OP’s grandma, what is this AT Protocol they are talking about here?
Introduce your jargon before expositing, please.
I wondered something similar when I clicked the link: "who is still using enough AT commands that a compressed representation would matter, and how would you even DO that?" But this is clearly something else.
Anyone who writes software that uses GSM modems for example. Like in embedded systems.
Oh, for sure - I've done some of that myself - but I would never associate the word "firehose" with such low-powered systems!
Iridium satellite modems too.
I’m shocked those things are still up there. That project tried to fail so many times.
And then I had to look up CBOR too, which at least is a thing Wikipedia has heard of. I mostly use compressed wire protocols and ignore the flavor of the month binary representations.
The article kind of assumes you know what it is in order to be interested in it, but it's the protocol used by Bluesky instead of ActivityPub.
Which makes it a bad submission for HN. If you want exposure, prepare for it.
you sound like you're a blast to be around
You mean the Hayes AT command set? We didn't call it a protocol back in the say.
> Bluesky recently saw a massive spike in activity in response to Brazil’s ban of Twitter.
If Jack Dorsey being involved from the beginning isn't enough proof it's the same company as Twitter, the first line of this press release covers one of Twitter's latest PR stunts: "The Brazil Ban". Two things:
1. Twitter is intentionally being wound down because a platform that free (see: Arab Spring) cannot exist - it's seen only as a threat to the military-media complex. Know your politics - only those who were active in startups in ~2010-14 know this (Gaddafi, Aaron Swartz, Ross Ulbricht). Things have changed since then guys - tech lost. Elon Musk is an actor/figurehead, and every decision made at Twitter is intentional to kill the platform while scraping as many dollars out of it as they can (even at its peak as a free speech platform it was rarely profitable as a company). I don't believe Bluesky is getting a ton of traffic from Twitter as they hoped it would because Bluesky looks like a 3rd party Twitter app (junky af) and has the same employees and none of the free speech or interesting celebrities/influencers Twitter had at its peak. With press releases like this one mentioning the "Brazil Ban" it just seems like Bluesky is merely a vassal - another aspect of the winding down of Twitter. I consider Bluesky founders to be scam artists and liars.
2. Bluesky is a hoax of a decentralized service in the way that Ethereum is a hoax chain. It's not really decentralized - they have total control over the information flow and always have.
To me this news is just like Ethereum's "The Merge" which also promised lower bandwidth, energy usage - computing. But it did this by cutting out most of the public contributors to the ledger and allowing only a few to contribute - thereby centralizing it and defeating the entire purpose to begin with.
Bluesky is a hoax, can't believe anyone falls for this junk. Their app looks terrible too. Innovate, make something people want.
>Before this new surge in activity, the firehose would produce around 24 GB/day of traffic. After the surge, this volume jumped to over 232 GB/day!
>Jetstream is a streaming service that consumes an AT Proto com.atproto.sync.subscribeRepos stream and converts it into lightweight, friendly JSON.
So let me get this straight. if you did want to run Jetstream yourself you'd still need to be able to handle the 232 GB/day of bandwidth?
This always has been my issue with Bluesky/AT Protocol, For all the talk about their protocol being federated, It really doesn't seem realistic for anyone to run any of the infrastructure themselves. You're always going to be reliant on a big player that has the capital to keep everything running smoothly. At this point I don't really see how it's any different then being on any of the old centralized social media.
Old social media never gave full access to the firehose so there’s a pretty big difference.
If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.
If you want a smaller scale view of the network, do a crawl of a subset of the users. That’s a perfectly valid usage of atproto, and is how ActivityPub works by nature.
>Old social media never gave full access to the firehose so there’s a pretty big difference.
That is good, but it's still a centralized source of truth.
>If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.
Thats just simply not true. ActivityPub does perfectly without the need of any bulky machine or node acting as a relay for the rest of the network. Every single ActivityPub service only ever interacts with other discovered services. Messages aren't broadcast through a central firehose, they're sent directly to who needs to receive them. This is a fundamental difference with how both protocols work. With ATProto you NEED to connect to some centralized relay that will broker your messages for you. With ActivityPub, there is no middle man, Instances just talk directly to each other. This is why ActivityPub has a discovery problem by the way, but it's just a symptom of real federation.
>and is how ActivityPub works by nature.
It's not. See Above.
Based on the article OP runs his Jetstream instance with 12 consumers (subsets of the full stream if I understand correctly) on a $5 VPS on OVH
That's only about 2.7 MB/s on average.
If someone wants to run a server, they probably would pay for a VPS with a gigabit connection, which would be able to do 120 MB/s.
You might need to pay for extra bandwidth, but it is probably less than a night out every month.
I thought this was about BlueSky using NATS!
Why is this being downvoted? Seems like a valid concern to raise if you find two pieces of software somewhat having the same functionality.
It has the same name, not the same functionality. I am not a downvoter... but it's probably because reading a few sentences of this blog post would reveal what it is.
Perhaps the complaint is more about project namers spending zero time checking for uniqueness.
A problem shared with AT as well.
I thought exactly the same
I'm never not going to look for Hayes command set topics when people talk about BlueSky.
You're not the only one. I don't get why they couldn't have named it something that wasn't very similar to something already around for several decades, or at least insist on the shortened ATproto name (one word, lower case p). Sure, in practice, no one will actually confuse them, but that could be said for Java and JavaScript.
Was expecting Nats Jetstream but this is also cool
Was expecting the Hayes modem command language.
Server-Sent Events (SSE) with standard gzip compression could be a simpler solution -- or maybe I'm missing something about the websocket + zstd approach.
SSE Benefits: Standard HTTP protocol, Built-in gzip compression, Simpler client implementation
Well-configured zstd can save a lot of bandwidth over gzip at this scale without major performance impact, especially with the custom dictionary. Initialising zstd with a custom dictionary also isn't very difficult for the client side.
As for application development, I think web socket APIs are generally exposed much better and used much easier than SSEs. I agree that SSEs are a more appropriate technology to use here, but they're used so little that I don't think the tooling is good. Just about every language has a dedicated websocket client library, but SSEs are usually implemented as a weird side effect of a HTTP connection you need to keep alive manually.
The stored ZSTD objects make sense, as you only need to compress once rather than compress for every stream (as the author details). It also helps store the data collected more efficiently on the server side if that's what you want to do.
I don't have an understanding of SSE in depth, but one of the points the post is arguing for is compress once (using zstd dictionary) and send that to every client.
The dictionary allows for better compression without needing a large amount of data, and sending every client the same compressed binary data saves a lot of CPU time in compression. Streams, usually, require running the compression for each client.