hackernews client

OpenTelemetry collector: What it is, when you need it, and when you don't

119 pointsposted 5 months ago

40 Comments

bfors

5 months ago

I happen to work with otel a lot so I'll offer a few of my thoughts:

- Consider decoupling your collector from whatever is consuming your traces with something like kafka. Traces can be pretty heavy and it can be tricky to scale collectors. If something goes down, it's probably a good idea to continue writing the traces to queue or topic.

- https://www.otelbin.io is a nice little tool to help with collector configuration

akshayKMR

5 months ago

I've been putting off a self-hosted observability setup for a long time. Any recommendations on basis ease of setup and operation? (For something low-medium scale).

My ideal setup would be to just write SQL on telemetry data and plot dashboards / set alerts.

Also, thoughts on Vector vs otel agent?

srcreigh

5 months ago

HyperDX is really great. It is basically SQL on telemetry data in clickhouse.

Don’t use vector or otel-agent. Add a materialized view in clickhouse to transform data and swap HyperDX to load from your view (in the UI.)

Jedd

5 months ago

> For something low-medium scale.

This isn't a lot to go on.

The important thing is what you're trying to instrument - hosts, applications, network, microservices, all of the above? (And then whether you want a few weeks retention, or keeping years worth.)

Grafana in front of Prometheus with node-exporter or telegraf (it can expose in prometheus mode) on the clients -- will tick a lot of boxes and is fast to get going.

Grafana in front of InfluxDB + telegraf is similar, but personally I find PromQL easier than InfluxQL.

> ... write SQL on telemetry data and plot dashboards / set alerts.

Read up about the design of TSDBs and log / tracing datastores - their design & intent heavily influences their query languages.

diurnalist

5 months ago

> Also, thoughts on Vector vs otel agent?

IMO, with the current tech, it entirely depends on what data you're talking about.

For metrics and traces, I would use the OTel collector personally. You will have much more flexibility and it's pretty easy to write custom processors in Go. Support for traces is quite mature and metrics isn't far off. We've been running collectors for production scale of metric and trace ingest for the past couple of years, on the order of 1m events/sec (metric datapoints or spans). You mentioned low volume so that's less important, but I just wanted to mention in case others find this comment.

Logs are a bit different. We looked in to this in the past year. Vector has emerging support for OTLP but it's pretty early. Still, I bet it's pretty straightforward if your backend can ingest via OTLP. Our main concern with running the otel-collector as the log ingest agent was around throughput/performance. Vector is battle-tested, otel is still a bit early in this space. I imagine over time the gap will be closed but I would probably still reach for Vector for this use-case for higher scale. That said, YMMV and as with any technical decision, empirical data and benchmarking on your workloads will be the best way to determine the tradeoffs.

For your scale you could probably get away with an OTel collector daemonset and maybe a deployment with the Target Allocator (to allocate Prometheus scrapes) and call it a day :)

GordonS

5 months ago

I'm using OpenObserve - it does logs, metrics and traces all under one roof. Handles alerts too.

It's been solid, but the UI is kind of clunky and a little buggy here and there. Dashboards are tricky to setup too. But it has no dependencies, and was easy to setup, and I couldn't find anything else that handled logs too.

pranay01

5 months ago

you might want to take a look at SigNoz - https://github.com/SigNoz/signoz logs, metrics & traces in a single pane and you can create advanced alerts and dashboards as well

PS: I am one of the maintainers

godisdad

5 months ago

I was able to setup SigNoz on the order of five minutes to view traces in my Dagger builds locally just by exporting the right env vars — it was nice to not have to run and orchestrate three+ tools together

GordonS

5 months ago

Having to deploy and manager Clickhouse put me off a little - OpenObserve uses it's own binary format, so there are zero external dependencies, making it super easy to setup. Maybe Clickhouse is nothing to worry about, but it's something I've never used before.

Also, I wasn't sure if Zookeeper was mandatory even for a single-server SigNoz install?

SigNoz UI certainly looks more polished tho!

sdairs

5 months ago

ClickHouse doesn't use ZooKeeper anymore, and if you're just using a single server you don't need to worry about coordination :)

ClickStack/HyperDX is a polished OOTB stack that has an all in one image you can deploy to get started, so you don't need to worry about the ClickHouse side until you need to really scale (which is where ClickHouse really shines).

GordonS

5 months ago

Interesting! Does HyperDX handle logs as well as metrics and traces?

bonobocop

5 months ago

Yeah, handles all the OTel signals

sdairs

5 months ago

Yeah metrics, logs, traces and can do session replay

cyberax

5 months ago

I've been using Uptrace in our docker-compose local setup. It runs just fine on a MacBook Air, and has support for tracing, metrics, and logs.

The UI is predictably an annoying mess, but that's the case with EVERY tracing solution I've tried. Very much including SigNoz.

pranay01

5 months ago

SigNoz maintainer here. Curious, when did you try SigNoz (which version/which timeframe) and any specific feedback on what you don't like about it's tracing UI? Would be helpful for us to understand areas to improve on

cyberax

5 months ago

Sigh. I have _plenty_ of gripes that would be easy to fix. My first "sniff test" for observability platforms is a tool to quickly jump to a given trace/span by ID. You don't have it. Uptrace has it.

Another issue is the complexity of switching between filtered views. A very useful primitive that you and Uptrace are missing: "show this event within the surrounding context". CloudWatch has it.

The other main overarching issue is ease of navigation and switching between contexts. You are actually somewhat better than Uptrace because I can actually cut&paste URLs on most of the pages and send them to my colleague over Slack.

But you make up for that by having bad search in traces (e.g. I can't just search all the traces with the word "UploadDoc" somewhere in them). Here's how Uptrace works: https://imgur.com/a/UWSdIEt

Your "Trace View" is ridiculous: I can't resize columns, I can't drag them to change the order, I can't even _show_ additional columns even though I can sort by them: https://www.loom.com/share/d5fa401384d94959978c0bb2be9010a5?...

Then you also are freaking annoying with the UI. I don't even care about everything getting extra-bloated. It's just par for the course for the modern UI vibe-based design.

But I get almost physically sick from these ridiculous popups: https://www.loom.com/share/21f5efdae8b84b12ba09c45cd2fa0855?...

Honestly, I think that most observability stacks (very much including SigNoz) are focusing on looking hip with cool dashboards. They totally suck when I need to dig deep into logs to find what happened.

pranay01

5 months ago

thanks for the detailed note

> My first "sniff test" for observability platforms is a tool to quickly jump to a given trace/span by ID.

You should be able to do this in SigNoz https://www.loom.com/share/71a2a95b76584b3983d9eeebb60ac420?...

> "show this event within the surrounding context"

we have this in the context logs. does this solve your use case or you mean something else? https://www.loom.com/share/9039afd5c4bf45e7b357a22c9943bb32?...

>But you make up for that by having bad search in traces

Did you mean for this to search across all attributes in spans or when you know which attribute you want to search in? If later, than you can do this through our query builder even today.

Your feedback on "Trace View" is fair. We are planning some improvements on that

oulipo2

5 months ago

I've been looking at HyperDX (ClickStack) and SigNoz, but those indeed are coupled

srcreigh

5 months ago

I tried both. Signoz is pretty sloppily built. For ex the self hosted option starts a ZK instance with 1 clickhouse host-no way to disable, 800MB ram. Signoz log transformation tool is broken and confusing.

HyperDX is just a lot better, sure a few papercuts but they got all the important stuff right imo.

pranay01

5 months ago

hey, SigNoz maintainer here.

Can you share which version of SigNoz did you try or what time frame? We recently made a lot of improvement in how you can host SigNoz including support for Postgres and better docs fro self hosting corretcly - https://signoz.io/docs/collection-agents/get-started/

PeterZaitsev

5 months ago

You can also consider https://coroot.com - it supports integration both with Collector and without as well as simulated eBPF traces for applications which are not Otel enabled.

sdairs

5 months ago

Sounds like you should take a look at ClickStack (HyperDX) to me

ndhandala

5 months ago

OneUptime does this with otel. Happy to help! Feel free to reach out at nawazdhandala [at] oneuptime [dot] com

smarx007

5 months ago

Seq?

CuriouslyC

5 months ago

This is my architecture, but with NATS. Can confirm, works well.

gm678

5 months ago

I know you can export directly to the backend, but the collector typically uses less than 50MB of RAM in my experience (even when handling lots of traces) and it's pretty easy to add sidecars to however you deploy your backends nowadays. Using Grafana SaaS metrics could look a little spiky or generally weirder without the collector, but normal with it, so I just default to using it now.

It's the shame the docs on it are still quite bad. The example config in the article here does look almost identical to the one we use everywhere, just without the redact, and should probably be pasted somewhere into the official docs.

Every provider seems to produce their own soft fork of the collector for branding (eg Alloy, ADOT, etc) and slightly changes the configuration, which doesn't help.

theletterf

5 months ago

The docs are open to contributions! Anybody can add better examples.

ejs

5 months ago

Otel stuff always seems overly complicated to me, but it must just be the types of projects I generally work with. Feels like observability meets java.

I've dabbled in building a project that collects metrics from the logs for smaller projects. Everyone tells me it's a bad idea, but it seems to work well for me.

jiggawatts

5 months ago

Open Telemetry is the XML of Observability, and is providing the same value XML did when it was first introduced: interoperability.

Eventually it'll have successors that are better in some way, more efficient, or whatever, but right now there are no alternatives at all. Open Telemetry is the first common standard that multiple vendors have signed up to.

ekjhgkejhgk

5 months ago

[dead]

k_bx

5 months ago

I'm evaluating Greptimedb in prod and while I hate to have a needless component like OTEL-Collector in general, it serves as a read-only gate between the database and the user, so that greptime keeps listening to localhost only, and OTEL-Collector guarantees nobody will write to the database directly.

If it were to give more fine-grained control over write-only access -- would probably just write directly and let it handle the load.

killme2008

5 months ago

Thank you for evaluating GreptimeDB.

We agree that fine-grained access control is important. A read-only user role will be available in the next major release.

kjuulh

5 months ago

I had a brief look at greptime db. And I'd like to give a little bit of feedback on your funnel. It is clear that your product marketing is targeting business folks rather than developers. That 3 minute vid on the frontpage was next to useless for me. Also very clearly AI.

Having stats is nice but i am not choosing your product because of stats. I actually think greptimedb is exactly what I am looking for, I.e. a humio / falcon logscale alternative. But I had to do some digging to actually infer that.

Your material doesn't highlight what sets you apart from the competition. If you want to target developers which you might not. I dont know.

I want to debug issues using freetext search, i want to be able to aggregate stats i care about on demand.

k_bx

5 months ago

I understand that it's mostly for enterprise users, but please also add ability to limit which databases can a user see and add a "write-only" access.

Or maybe I'll contribute this piece myself when I'll have time :)

p.s.: btw, I love Greptime so far, thank you for the product!

pewpewp

5 months ago

I did not like working with OpenTelemetry; made me miss the good old days (monolith).

CharlieDigital

5 months ago

OTEL is still useful in a monolith because it allows you to correlate things together, attach nested span, attach events, etc.

cyberax

5 months ago

Just always use a collector. It really simplifies your life. Your app then just always talks with localhost without any authentication. And during the development, you can just set up local Uptrace to help you with debugging with just a different collector config.

And while all the tracing providers speak the OTEL protocol, the way your do auth is not the same. Sometimes you need to specify it in a header, sometimes it's a part of the URL.

TechIsCool

5 months ago

This collector is one of my favorites to ask Copilot Agent to use for validation when the stack is missing tests. You give the agent a couple well written prompts of what you expect to happen and since the app has distributed tracing enabled. all logs flow to text and are consumable by the agent.

user

5 months ago

[deleted]

curtisszmania

5 months ago

[dead]