stego-tech
a day ago
Excellent critique of the state of observability, especially for us IT folks. We’re often the first - and last, until the bills come - line of defense for observability in orgs lacking a dedicated team. SNMP Traps get us 99% of the way there with anything operating in a standard way, but OTel/Prometheus/New Relic/etc all want to get “in the action” in a sense, and hoover up as much data points as possible.
Which, sure, if you’re willing to pay for it, I’m happy to let you make your life miserable. But I’m still going to be the Marie Kondo of IT and ask if that specific data point brings you joy. Does having per-second interval data points actually improve response times and diagnostics for your internal tooling, or does it just make you feel big and important while checking off a box somewhere?
Observability is a lot like imaging or patching: a necessary process to be sure, but do you really need a Cadillac Escalade (New Relic/Datadog/etc) to go to the grocery store when a Honda Accord (self-hosted Grafana + OTel) will do the same job more efficiently for less money?
Honestly regret not picking the Observability’s head at BigCo when I had the chance. What little he showed me (self-hosted Grafana for $90/mo in AWS ECS for the corporate infrastructure of a Fortune 50? With OTel agents consuming 1/3 to 1/2 the resources of New Relic agents? Man, I wish I had jumped down that specific rabbit hole) was amazingly efficient and informative. Observation done right.
jsight
a day ago
>Observability is a lot like imaging or patching: a necessary process to be sure, but do you really need a Cadillac Escalade (New Relic/Datadog/etc) to go to the grocery store when a Honda Accord (self-hosted Grafana + OTel) will do the same job more efficiently for less money?
The way that I've seen it play out is something like this:
1. We should self host something like Grafana and otel.
2. Oh no, the teams don't want to host individual instances of that, we should centralize it!
(2b - optional, but common, Random team gets saddled with this job)
3. Oh no, the centralized team is struggling with scaling issues and the service isn't very reliable. We should outsource it for 10x the cost!
This will happen even if they have a really nice set of deployment infrastructure and patterns that could have allowed them to host observability at the team level. It turns out, most teams really don't need the Escalade, they just need some basic graphs and alerts.Self hosting needs to be more common within organizations.
baby_souffle
a day ago
Another variant of step 2: some individual with a little bit of political Capital sees something new and shiny and figures out how to be the first project internally to use influx, for example, over Prometheus... And now you have a patchwork of dashboards, each broken in their own unique way...
rbanffy
a day ago
> But I’m still going to be the Marie Kondo of IT and ask if that specific data point brings you joy.
There seems to be a strong "instrument everything" culture that, I think, misses the point. You want simple metrics (machine and service) for everything, but if your service gets an error every million requests or so, it might be overkill to trace every request. And, for the errors, you usually get a nice stack dump telling you where everything went wrong (and giving you a good idea of what was wrong).
At that point - and only at that point, I'd say it's worth to TEMPORARILY add increased logging and tracing. And yes, it's OK to add those and redeploy TO PRODUCTION.
prymitive
a day ago
> There seems to be a strong "instrument everything" culture
Metrics are the easiest way to simply expose your application internal state and then, as a maintainer of that service, you’re in nirvana. And even if you don’t go that far you’re likely to be an engineer writing code and when it comes time to add some metrics why wouldn’t you add more rather than less, and once you have all of them why not adding all possible labels? And in the meantime your Prometheus server is in a crash loop because it run if of RAM, but that’s not a problem visible to you. Unfortunately there’s a big gap in understanding between a code editor writing instrumentation code and the effect in resource usage on the other end of your observability pipeline.
sshine
a day ago
I can only say, I tried to add massive amounts of data points to a fleet of battery systems once; 750 cells per system, 8 metrics per cell, one cell every 20 ms. It became megabits per second, so we only enabled it when engaging the batteries. But the data was worth it, because we could do data modelling on live events in retrospect when we were initially too busy fixing things. Observability is a super power.
baby_souffle
a day ago
This right here! Don't be afraid to over instrument. You can always down Rez or even just basic statistical sampling before you actually commit your measurements to Time series database.
As annoying as that may sound, it's a hell of a lot harder to go back in time to observe that bizarre intermittent issue...
rbanffy
13 hours ago
This is very true for systems and components you don’t thoroughly understand.
By all means instrument batteries, motors, gearboxes, positioning systems, turbines, and so on. If your system is mostly a CRUD or some business logic automation, then the need to overinstrumenting it is much smaller. Databases are well understood after all.
mping
a day ago
On paper this looks smart, but when you hit a but that triggers under very specific conditions (weird bugs happen more often as you scale), you are gonna wish you had tracing for that.
The ideal setup is that you trace as much for some given time frame, if your stack supports compression and tiered storage it becomes cheap er
rbanffy
13 hours ago
> you are gonna wish you had tracing for that
It’s never too late to add data collection and redeploy the affected service. If the bug is bad, you’ll be able to catch it then.
Nextgrid
a day ago
> but do you really need a Cadillac Escalade (New Relic/Datadog/etc) to go to the grocery store
Depends if your objective is to go to the grocery store or merely showing off going to the grocery store.
During the ZIRP era there was a financial incentive for everyone to over-engineer things to justify VC funding rounds and appear "cool". Business profitability/cost-efficiency was never a concern (a lot of those business were never viable and their only purpose was to grift VC money and enjoy the "startup founder" lifestyle).
Now ZIRP is over, but the people who started their career back then are still here and a lot of them still didn't get the memo.
stego-tech
a day ago
> During the ZIRP era there was a financial incentive for everyone to over-engineer things to justify VC funding rounds and appear "cool".
Yep, and what’s worse is that…
> Now ZIRP is over, but the people who started their career back then are still here and a lot of them still didn't get the memo.
…folks let go from BigTech are filtering into smaller orgs, and the copy-pasters and “startup lyfers” are bringing this attitude with them. I guess I got lucky enough to start my interest in tech before the dotcom crash, my career just before the 2008 crash, and finished my BigTech tenure just after COVID (and before the likely AI crash), and thus am always weighing the costs versus the benefits and trying to be objective.
Nextgrid
a day ago
> folks let go from BigTech are filtering into smaller orgs, and the copy-pasters and “startup lyfers” are bringing this attitude with them
Problem is, not all of them are even doing this intentionally. A lot actually started their career during that clown show, so for them this is normal and they don't know any other way.
stego-tech
a day ago
Yeah, very true, and those of us with more life and career experience (it me) have a societal contract of sorts to teach and lead them out of bad habits or design choices. If we don’t show them a better path forward, they’ll have to suffer to seek it out just like we had to.