hackernews client

Kafka Is Costing You Years of Engineering Time

3 pointsposted 9 months ago

10 Comments

deniscoady

9 months ago

Disclaimer: I work for Redpanda and formerly Cloudera.

I've worked with Apache Kafka at massive (50+ Gbps) scales. It's a proper nightmare. When it breaks – it breaks fast and violently.

But the problem is that Apache Kafka (and more modern Kafka-compatible alternatives like Redpanda < obligatory mention) solve a need for a durable streaming log that other systems cannot offer. The access patterns, requirements, use cases, ecosystem, etc, are different from those of traditional databases and require a proper streaming solution.

Streaming from a traditional database is kinda a solved problem. Why not just use a managed Kafka provider with a change data capture (CDC) capability if you don't want to deal with Kafka yourself? At least then you get to use all of the tools in the vibrant Kafka ecosystem.

galeaspablo

9 months ago

Hey Denis, I haven’t run into you before. But hi, this is Luis, Ambar’s founder. Nice to meet a fellow data streamer.

When I started writing Ambar I thought streaming from a database was a solved problem. But in operational use cases where ordering and delivery guarantees are assumptions developers need, it isn’t a solved problem. The first version of Ambar was just Debezium under the hood, but guess what, it failed and failed hard. Like you described Kafka. Hence we built Ambar :)

FYI we’ve considered using Redpanda under the hood instead of Kafka, but didn’t dare make the jump yet.

deniscoady

9 months ago

Ah okay, so is Ambar more of a way to finally replace Debezium then?

galeaspablo

9 months ago

Yes, for operational use cases. Eg event driven microservices communication. Keeping in mind we replace the sink as well, which allows us to do cool things such as https://ambar.cloud/blog/optimal-consumption-with-adaptive-l...

For analytics (eg copy your PG database to Snowflake), Debezium is still relevant.

Sphax

9 months ago

I've been using Kafka professionally for more than 10 years, since 0.8 where consumer groups didn't even exist yet. In my opinion this post exagerates a lot of things to promote their product. We don't have giant clusters but we routinely do more than a million messages produced/s so it's not a completely trivial load.

Configuration complexity: there are a couple of things we had to tune over the years, mainly regarding the log cleaner once we started leveraging compacted topics, but other than that it's pretty much the default config. Is it the most optimal ? No but it's fast enough. Hardware choice in my opinion is not really an issue: we started on HDDs switching to SSDs later on, the cluster continued working just fine with the same configuration.

Scaling I'll grant can be a pain. We had to scale our clusters mainly for two reasons: 1) more services want to use Kafka therefore there are more topics and more data to serve. This is not that hard to scale: just add brokers to have more capacity. 2) is when you need more partitions for a topic; we had to do this a couple of times over the years and it's annoying because the default tooling to do data redistribution is bad. We ended up using a third party tool (today Cruise Control does this nicely).

Maintenance: yes, you need to monitor your stuff. Just like any other system you deploy on your own hardware. Thankfully monitoring Kafka is not _that_ hard, there are ready made solutions to export the JMX monitoring data. We use Prometheus (prometheus-jmx-exporter and node_exporter) almost since the beginning and it works fine. We're still using ZooKeeper but thankfully that's no longer necessary, I just have to say our zookeeper clusters have been rock solid over the years.

Development overheads: I really can't agree with that. Yes, the "main" ecosystem is Java based but it's not like librdkafka doesn't exist, and third party libraries are not all "sub par", that's just a mischaracterization. We use Go with sarama since 2014, recently switched to using franz-go: both work great. You do need to properly evaluate your options though (but that's part of your job). With that said, if I were to start from scratch I would absolutely suggest starting with Kafka Streams, even if your team doesn't hava java experience (I mean learning Java isn't that hard), just because it makes building a data pipeline super straightforward and handle a lot of the complexities mentioned.

taylodl

9 months ago

Conspicuously missing from this article is any mention of an alternative. Kafka, bad. Alternative, what alternative?

erik_seaberg

9 months ago

https://docs.ambar.cloud/ says the author's org is polling PostgreSQL (or MySQL) tables and producing records for JSON/HTTP consumers. As for me, I think a Kafka broker quorum is hard to beat for a fast and durable distributed ringbuffer.

galeaspablo

9 months ago

We use Kafka under the hood. We stream instead of poll. We used to work at the Kafka team in AWS :)

Our thesis is that a big blocker is the PhD the whole team needs in Kafka. For example, if you want to set up an API similar to Ambar with tools like Kafka connect, the connectors have failure modes that will bite you once a year and bite you hard. Eg losing your changelog in MySQL, and having to start from scratch or risk losing ordering guarantees.

galeaspablo

9 months ago

Disclosure: this post is from my colleague.

A: You have managed vendors that simplify Kafka, such as MSK, Confluent, and Redpanda. And other software like Pulsar.

But we believe the solution to the time sink exposed in the article lies one level of abstraction higher. In the case of data analytics, there are tools/companies such as Decodable/Streamkap/Airbyte that simplify your life as an engineer.

In the case of operational streaming, we (Ambar) are making a bet on the tried and tested outbox/inbox pattern as a replacement for producing directly into Kafka et al, and thus managing all of its quirks and complexities. That’s the alternative we offer, but of course there are other folks in this space.

Admittedly, we didn’t dive deep into alternatives in the post. But we did explain at the end that we’ll cover it in another post. I’ll add a link at the bottom later pointing to some alternatives. :)

Thanks for reading!

jauntywundrkind

9 months ago

RisingWave, RedPanda, Apache Pulsar, Druid, others might do in various cases