hackernews client

Logging sucks

585 pointsposted 2 months ago

231 Comments

mnahkies

2 months ago

That was difficult to read, smelt very AI assisted though the message was worthwhile, it could've been shorter and more to the point.

A few things I've been thinking about recently:

- we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.

- logging an error as a separate log line to the request log is a pain. You can filter for the trace, but it makes it hard to surface "show me all the logs for 5xx requests and the error associated" - it's doable, but it's more difficult than filtering on the status code of the request log

- it's not enough to just start including that context, you have to educate your coworkers that it's now present. I've seen people making life hard for themselves because they didn't realize we'd added this context

xmprt

2 months ago

On the other hand, investing in better tracing tools unlocks a whole nother level of logging and debugging capabilities that aren't feasible with just request logs. It's kind of like you mentioned with using the user id as a "trace" in your first message but on steroids.

dexwiz

2 months ago

These tools tend to be very expensive in my experience unless you are running your own monitoring cloud. Either you end up sampling traces at low rates to save on costs, or your observability bill is more than your infrastructure bill.

jonasdegendt

2 months ago

We self host Grafana Tempo and whilst the cost isn’t negligible (at 50k spans per second), the money saved in developer time when debugging an error, compared to having to sift through and connect logs, is easily an order of magnitude higher.

dietr1ch

2 months ago

Doing stuff like turning on tracing for clients that saw errors in the last 2 minutes, or for requests that were retried should only gather a small portion of your data. Maybe you can include other sessions/requests at random if you want to have a baseline to compare against.

valyala

2 months ago

Try open-source databases specially designed for traces, such as Grafana Tempo or VictoriaTraces. They can handle the data ingestion rate of hundreds of thousands trace spans per second on a regular laptop.

sahilagarwal

2 months ago

I like to write them on my own in every company Im in using bash. So I have a local set of bash commands to help me figure out logs and colorize the items I want to.

Takes some time and its a pain in the ass initially, but once I've matured them - work becomes so much more easy. Reduces dependability on other people / teams / access as well.

Edit: Thinking about this, they wont work in other use cases. Im a data engineer so my jobs are mostly sequential.

oulipo2

2 months ago

I've tried HyperDX and SigNoz, they seem easy to self-host and decent enough

spike021

2 months ago

If your codebase has the concept of a request ID, you could also feasibly use that to trace what a user has been doing with more specificity.

ivan_gammel

2 months ago

…and the same ID can be displayed to user on HTTP 500 with the support contact, making life of everyone much easier.

dexwiz

2 months ago

I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense. UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.

A good compromise is to log whenever a user would see the error code, and treat those events with very high priority.

spockz

2 months ago

We put the error code behind a kind of message/dialog that invites the user to contact us if the problem persists and then report that code.

It’s my long standing wish to be able to link traces/errors automatically to callers when they call the helpdesk. We have all the required information. It’s just that the helpdesk has actually very little use for this level of detail. So they can only attach it to the ticket so that actual application teams don’t have to search for it.

inkyoto

2 months ago

> I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense […]

There are two dimensions to it: UX and security.

Displaying excessive technical information on an end-user interface will complicate support and likely reveal too much about the internal system design, making it vulnerable to external attacks.

The latter is particularly concerning for any design facing the public internet. A frequently recommended approach is exception shielding. It involves logging two messages upon encountering a problem: a nondescript user-facing message (potentially including a reference ID pinpointing the problem in space and time) and a detailed internal message with the problem’s details and context for L3 support / engineering.

dist1ll

2 months ago

Sorry for the OT response, I was curious about this comment[0] you made a while back. How did you measure memory transfer speed?

[0] https://news.ycombinator.com/item?id=38820893

inkyoto

2 months ago

I used «powermetrics» bundled with macOS with «bandwidth» as one of the samplers (--samplers / -s set to «cpu_power,gpu_power,thermal,bandwidth»).

Unfortunately, Apple has taken out the «bandwidth» sampler from «powermetrics», and it is no longer possible to measure the memory bandwidth as easily.

KronisLV

2 months ago

> UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.

Just ignore them or provide appeasement insofar that it doesn’t mess with your ability to maintain the system.

  (cat picture or something)
  
  Oh no, something went wrong.
  
  Please don’t hesitate to reach out to our support: (details)
  This code will better help us understand what happened: (request or trace ID)

ivan_gammel

2 months ago

Nah, that’s easy problem to solve with UX copy. „Something went wrong. Try again or contact support. Your support request number is XXXX XXXX“ (base 58 version of UUID).

mnahkies

2 months ago

We do have both a span id and trace id - but I personally find this more cumbersome over filtering on a user id. YMMV if you're interested in a single trace then you'd filter for that, but I find you often also care what happened "around" a trace

kulahan

2 months ago

If you care about this more than anything else (e.g. if you care about audits a LOT and need them perfect), you can simply code the app via action paths, rather than for modularity. It makes changes harder down the road, but for codebases that don’t change much, this can be a viable tradeoff to significantly improve tracing and logging.

nine_k

2 months ago

...if it does not, you should add it. A request ID, trace ID, correlation key, whatever you call it, you should thread it through every remote call, if you value your sanity.

giancarlostoro

2 months ago

TIDs are good here too. If you generate it and enforce it across all your services spanning various teams and APIs anyone of any team can grab a TID you provide and you can get the full end to end of one transaction.

valtism

2 months ago

Wow, I didn't think this was badly written at all! I certainly don't think it smells like AI. Are you conflating lists with AI written prose?

weebull

2 months ago

> - we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.

Depends on the service, but tracking everything a user does may not be an option in terms of data retention laws

khazhoux

2 months ago

> That was difficult to read, smelt very AI assisted though the message was worthwhile...

It won’t be long before ad computem comments like this are frowned upon.

bccdee

2 months ago

Why? "This was written badly" is a perfectly normal thing to say; "this was written badly because you didn't put in the effort of writing it yourself" doubly so.

0xbadcafebee

2 months ago

Say they used AI to write it, it came out bad, and they published it anyway. They had the opportunity to "make it better" before publishing, but didn't. The only conclusion for this is, they just aren't good at writing. So whether AI is used or not, it'll suck either way. So there's no need to complain about the AI.

It's like complaining that somebody typed a crappy letter rather than hand-wrote it. Either way the letter's gonna suck, so why complain that it was typed?

minitech

2 months ago

Compared to human bad writing, AI writing tends to suck more verbosely and in exciting new ways (e.g. by introducing factual errors).

0xbadcafebee

2 months ago

> AI writing tends to suck more verbosely

So, it's the style you oppose, the way a grammar nazi complains about "improper" English

> and in exciting new ways (e.g. by introducing factual errors).

Because factually incorrect comments didn't exist before AI?

Your concern is that you read something you don't like, so you pick the lowest-effort criteria to complain about. Speaks more about you than the original commenter.

damentz

2 months ago

I'm pretty sure by verbose it's the realization you've wasted precious time reading AI bloat that you'll never get back. On top of that, now you need to reread the text for hallucinations or just take a loss and ignore any conclusions at risk that they came from bad data.

Dylan16807

2 months ago

> The only conclusion for this is, they just aren't good at writing.

Not true. It's likely an effort issue in that situation.

And that kind of effort issue is good to call out, because it compounds the low quality.

0xbadcafebee

2 months ago

I don't know if you're new to the internet, but low-effort comments have existed before AI, and will continue to exist regardless of AI.

alwa

2 months ago

I read it as a more-or-less kind comment: “even though you’ll notice that they let an AI make the writing terrible, the underlying point is good enough to be worth struggling through that and discussing”

mnahkies

2 months ago

I felt unsure whether to include that particular comment, but landed on including because I think it's a real danger. I've got no problem with people using AI and do use it for some things myself.

However I don't think you should outsource understanding to LLMs, and also think that shifting the effort from the writer to the reader is a poor strategy (and disrespectful to the reader)

edit: in case it's unclear I'm not accusing the author of having outsourced their understanding to AI, but I think it's a real risk that people can fall into, the value is in the thinking people put into things not the mechanics of typing it out

heinrichhartman

2 months ago

A post on this topic feels incomplete without a shout-out to Charity Majors - she has been preaching this for a decade, branded the term "wide events" and "observability", and built honeycomb.io around this concept.

Also worth pointing out that you can implement this method with a lot of tools these days. Both structured Logs or Traces lend itself to capture wide events. Just make sure to use a tool that supports general query patterns and has rich visualizations (time-series, histograms).

Aurornis

2 months ago

> she has been preaching this for a decade, branded the term "wide events" and "observability",

With all due respect to her other work, she most certainly did not coin the term “observability”. Observability has been a topic in multiple fields for a very long time and has had widespread usage in computing for decades.

I’m sure you meant well by your comment, but I doubt this is a claim she even makes for herself.

She has been an influential writer on the topic and founded a company in this space, but she didn’t actually create the concept or terminology of observability.

the_mitsuhiko

2 months ago

> A post on this topic feels incomplete without a shout-out to Charity Majors

I concur. In fact, I strongly recommend anyone who has been working with observability tools or in the industry to read her blog, and the back story that lead to honeycomb. They were the first to recognize the value of this type of observability and have been a huge inspiration for many that came after.

dcminter

2 months ago

Could you drop a few specific posts here that you think are good for someone (me) who hasn't read her stuff before? Looks like there's a decade of stuff on her blog and I'm not sure I want to start at the very beginning...

simonw

2 months ago

A few of my favourites:

- Software Sprawl, The Golden Path, and Scaling Teams With Agency: https://charity.wtf/2018/12/02/software-sprawl-the-golden-pa... - introduces the idea of the "golden path", where you tell engineers at your company that if they use the approved stack of e.g. PostgreSQL + Django + Redis then the ops team will support that for them, but if they want to go off path and use something like MongoDB they can do that but they'll be on the hook for ops themselves.

- Generative AI is not going to build your engineering team for you: https://stackoverflow.blog/2024/12/31/generative-ai-is-not-g... - why generative AI doesn't mean you should stop hiring junior programmers.

- I test in prod: https://increment.com/testing/i-test-in-production/ - on how modern distributed systems WILL have errors that only show up in production, hence why you need to have great instrumentation in place. "No pull request should ever be accepted unless the engineer can answer the question, “How will I know if this breaks?”"

- Advice for Engineering Managers Who Want to Climb the Ladder: https://charity.wtf/2022/06/13/advice-for-engineering-manage...

- The Engineer/Manager Pendulum: https://charity.wtf/2017/05/11/the-engineer-manager-pendulum... - I LOVE this one, it's about how it's OK to have a career where you swing back and forth between engineering management and being an "IC".

manmal

2 months ago

The one on Generate AI seems a bit outdated. This was before Claude Code was released.

simonw

2 months ago

Most of that one still rings very true to me. I particularly liked this section:

> Let’s start here: hiring engineers is not a process of “picking the best person for the job”. Hiring engineers is about composing teams. The smallest unit of software ownership is not the individual, it’s the team. Only teams can own, build, and maintain a corpus of software. It is inherently a collaborative, cooperative activity.

manmal

2 months ago

I totally agree with this part.

Right now, we are in a transitioning phase, where parts of a team might reject the notion of using AI, while others might be using it wisely, and still others might be auto-creating PRs without checking the output. These misalignments are a big problem in my view, and it’s hard to know (for anybody involved) during hiring what the stance really is because the latter group is often not honest about it.

dcminter

2 months ago

Terrific, thank you.

ryeguy

2 months ago

Honeycomb is inspired by Facebook's Scuba (https://research.facebook.com/publications/scuba-diving-into...). The paper is from 2013, predating honeycomb. Charity worked there as well, but presumably was not part of the initial implementation given the timing.

loevborg

2 months ago

I've learned more from Charity about telemetry than from anyone else. Her book is great, as are her talks and blog posts. And Honeycomb, as a tool, is frankly pretty amazing

Yep, I'm a fan.

Aurornis

2 months ago

> They were the first to recognize the value of this type of observability

With all due respect to her great writing, I think there’s a mix of revisionist history blended with PR claims going on in this thread. The blog has some good reading, but let’s not get ahead of ourselves in rewriting history around this one person/company.

the_mitsuhiko

2 months ago

> I think there’s a mix of revisionist history blended with PR claims going on in this thread.

I can only speak for myself. I worked for a company that is somewhere in the observability space (Sentry) and Charity was a person I looked up to my entire time working on Sentry. Both for how she ran the company, for the design they picked and for the approaches they took. There might be others that have worked on wide events (afterall, Honeycomb is famously inspired by Facebook's scuba), she is for sure the voice that made it popular.

fishtoaster

2 months ago

This post was so in-line with her writing that I was really expecting it to turn into an ad for Honeycomb at the end. I was pretty surprised with it turned out the author was unaffiliated!

rjbwork

2 months ago

Nick Blumhardt for a while longer than that as "structured logging". Seq and Serilog as enabling software and library in the .net ecosystem.

layer8

2 months ago

The article emphasizes that their recommendation is different from structured logging.

vasco

2 months ago

She has good content but no single person branded the term "observability", what the heck. You can respect someone without making wild claims.

Illniyar

2 months ago

While I agree with some of it, I feel like there's a big gotcha here that isn't addressed. Having 1 single wide event, at the end of a request, means that if something unexpected happens in the middle (stack overflow, some bug that throws an error that bypasses your logging system, lambda times out etc...) you don't get any visibility into what happens.

You also most likely lose out on a lot of logging frameworks your language has that your dependencies might use.

I would say this is a good layer to put on top of your regular logs. Make sure you have a request/session wide id and aggregate all those in your clickhouse or whatever into a single "log".

wesammikhail

2 months ago

The way I have solved for this in my own framework in PHP is by having a Logging class with the following interface

  interface LoggerInterface {

    // calls $this->system(LEVEL_ERROR, ...);

    public function exception(Throwable $e): void;

    // Typical system logs

    public function system(string $level, string $message, ?string $category = null, mixed ...$extra): void;

    // User specific logs that can be seen in the user's "my history" 

    public function log(string $event, int|string|null $user_id = null, ?string $category = null, ?string $message = null, mixed ...$extra): void;
  }

I also have a global exception handler that is registered at application bootstrap time that takes any exception that happens at runtime and runs $logger->exception($e);

There is obviously a tiny bit more of boilerplating to this to ensure reliability, but it works so well that I can't live without it anymore. The logs are then inserted into a wide DB table with all the field one could ever want to examine thanks to the variadic parameter.

hu3

2 months ago

Nice. I guess you write logs on the "final" block of a global try/catch/final?

Something like:

  try {
    // handle request code
  } catch (...) {
    // add exceptions to log
  } final {
    // insert logs into DB
  }

wesammikhail

2 months ago

I used to do it like that and it worked really well but I changed the flow to where exceptions are actually part of the control flow of the app using PHP's set_exception_handler(), set_error_handler() and register_shutdown_function().

Example, lets say a user forgot to provide a password when authenticating, then I will throw a ClientSideException(400, "need password yada yada");

That exception will bubble up to the exception_handler that logs and outputs the proper message to the screen. Similarly if ANY exception is thrown, regardless of where it originated, the same will happen.

When you embrace exceptions as control flow rather than try to avoid them, suddenly everything gets 10x easier and I end up writing much less code overall.

hu3

2 months ago

I love Exceptions as control flow! Thanks for the suggestion.

I too use specialized exceptions. Some have friendly messages that can be displayed to the user, like "Please fill the password". But critical exceptions display a generic error to the user ("Ops, sorry something went wrong on our side...") but log specifics to devs, like database connection errors, for example.

hansvm

2 months ago

If that's an issue (visibility into middle layers) it just means your events aren't wide enough. There's no fundamental difference between log.error(data) and wide_event.attach(error, data), nor similar schemes using parameters rather that context/global-based state.

There are still use cases for one or the other strategy, but I'm not a fan of this explanation in either direction.

yallpendantools

2 months ago

> If that's an issue (visibility into middle layers) it just means your events aren't wide enough.

I hate this kind of No-True-Scotsman handwaves for how a certain approach is supposed to solve my problems. "If brute-force search is not solving all your problems, it just means your EC2 servers are not beefy enough."

I gotta admit, I don't quite "get" TFA's point and the one issue that jumped out at me while reading it and your comment is that sooner than later your wide events just become fat, supposedly-still-human-readable JSON dumps.

I think a machine-parseable log-line format is still better than wide events, each line hopefully correlated with a request id though in practice I find that user id + time correlation isn't that bad either.

>> [TFA] Wide Event: A single, context-rich log event emitted per request per service. Instead of 13 log lines for one request, you emit 1 line with 50+ fields containing everything you might need to debug.

I am not convinced this is supposed to help the poor soul who has to debug an incident at 2AM. Take for example a function that has to watch out for a special kind of user (`isUserFoo`) where "special kind" is defined as a metric on five user attributes. I.e.,

    lambda isUserFoo(u): u.isA && (u.isB || u.isC) && (u.isD || u.isE)

With usual logging I might find

    <timestamp> : <function> : {"level": "INFO", "requestID": "xxxaaa", "msg": "user is foo"}

Which immediately tells me that foo-ness is something I might want to pay attention to in this context.

With wide events, as I understand it, either you log the user in the wide event dump with attributes A to E (and potentially more!) or coalesce these into a boolean field `isUserFoo`. None of which tells me that foo-ness might be something that might be relevant in this context.

Multiply that with all the possible special-cases any logging unit might have to deal with. There's bar-ness which is also dependent on attributes A-E but with different logical connectives. There's baz-ness which is `isUserFoo(u) XOR (217828 < u.zCount < 3141592)`. The wide event is soooo context-rich I'm drowning.

hansvm

2 months ago

Your objection, as I understand it, is some combination of "no true Scotsman" combined with complaints about wide events themselves.

To the first point (no true Scotsman), I really don't think that applies in the slightest. The post I'm replying to said (paraphrasing) that middle-layer observability is hard with wide events and easy with logs. My counter is that the objection has nothing to do with wide events vs logs, since in both scenarios you can choose to include or omit more information with the same syntactic (and similar runtime overhead) ease. I think that's qualitatively different from other NTS arguments like TDD, in that their complaint is "I don't have enough information if I don't send it somewhere" and my objection is just "have you tried sending it somewhere?" There isn't an infinite stream of counter-arguments about holding the tool wrong; there's the very dumbest suggestion a rubber duck might possibly provide about their particular complaint, which is fully and easily compatible with wide events in every incarnation I've seen.

To the second point (wide events aren't especially useful and/or suck), I think a proper argument for them is a bit more nuanced (and I agree that they aren't a panacea and aren't without their drawbacks). I'll devote the rest of my (hopefully brief) comment to this idea.

1. Your counter-example falls prey to the same flaw as the post I responded to. If you want information then just send that information somewhere. Wide events don't stop you from gathering data you care about. If you need a requestID then it likely exists in the event already. If you need a message then _hopefully_ that's reasonably encoded in your choice of sub-field, and if it's not then you're free to tack on that sort of metadata as well.

2. Your next objection is about the wide event being so context-rich as to be a problem in its own right. I'm sympathetic to that issue, but normal logging doesn't avoid it. It takes exactly one production issue where you can't tie together related events (or else can tie them together but only via hacks which sometimes merge unrelated events with similar strings) for you to realize that completely disjoint log lines aren't exactly a safe fallback. If you have so much context-dependent complexity that a wide event is hard to interpret then linear logs are going to be a pain in the ass as well.

Mildly addressing the _actual_ pros and cons: Logs and wide events are both capapable of transmitting the same information. One reasonable frame of reference is viewing wide events as "pre-joined" with a side helping of compiler enforcement of the structure.

It's trivially possible to produce two log lines in unrelated parts of a code base which no possible parser can disambiguate. That's not (usually) possible when you have some data specification (your wide event) mediating the madness.

It's sometimes possible with normal logs to join on things which matter (as in your requestID example), but it's always possible with wide events since the relevant joins are executed by construction (also substantially more cheaply than a post-hoc join). Moreover, when you have to sub-sample, wide events give an easy strategy for ensuring your joins work (you sub-sample at a wide event level rather than a log-line level) -- it's not required; I've worked on systems with a "log seed" or whatever which manage that joinability in the face of sub-sampling, but it's more likely to "just work" with wide events.

The real argument in favor of wide events, IMO, is that it encourages returning information a caller is likely to care about at every level of the stack. You don't just get potentially slightly better logs; you're able to leverage the information in better tests and other hooks into the system in question. Parsing logs for every little one-off task sucks, and systems designed to be treated that way tend to suck and be nearly impossible to modify as desired if you actually have to interact with "logs" programatically.

It's still just one design choice in a sea of other tradeoffs (and one I'm only half-heartledly pursuing at $WORK since we definitely have some constraints which are solved by neither wide events nor traditional logging), BUT:

1. My argument against some random person's choice of counter-argument is perfectly sound. Nothing they said depended on wide events in the slightest, which was my core complaint, and I'm very mildly offended that anyone capable of writing something as otherwise sane and structured as your response would think otherwise.

2. Wide events do have a purpose, and your response doesn't seem to recognize that point in the design space. TFA wasn't the most enjoyable thing I've ever read, but I don't think the core ideas were that opaque, and I don't think a moment's consideration of carry-on implications would be out of the question either. I could be very wrong about the requisite background to understand the article or something, but I'm surprised to see responses of any nature which engage with irrelevant minutea rather than the subset of core benefits TFA chose to highlight (and I'm even more surprised to see anything in favor of or against wide events given my stated position that I care more about the faulty argument against them than whether they're good or bad)..

cpill

2 months ago

I wonder if one might solve this by using an accumulator that merges objects as they are emitted based on some ID (i.e. request ID say) and then ether emits the object on normal execution or a global exception handler emits it on error...?

runeks

2 months ago

I was going to say that. That definitely would be a solution (and ought to be the way it works).

pastage

2 months ago

Having logs in the format "connection X:Y accepted at Z ns for http request XXX" and then a "connection X:Y closed at Z ns for http response XXX" is rather nice when debugging issues on slow systems.

thevinter

2 months ago

The presentation is fantastic and I loved the interactive examples!

Too bad that all of this effort is spent arguing something which can be summarised as "add structured tags to your logs"

Generally speaking my biggest gripe with wide logs (and other "innovative" solutions to logging) is that whatever perceived benefit you argue for doesn't justify the increased complexity and loss of readability.

We're throwing away `grep "uid=user-123" application.log` to get what? The shipping method of the user attached to every log? Doesn't feel an improvement to me...

P.S. The checkboxes in the wide event builder don't work for me (brave - android)

dannyfreeman

2 months ago

Do you really loose the ability to grep? You can still search for json fragments `grep '"uid": "user-123"' application.log`

If the json logged isn't pretty printed everything should still be on one line. You can also grep with the `--context` flag to get more surrounding lines.

mrkeen

2 months ago

It's not even that bad.

As long as it's actual json, it doesn't matter if it's pretty-printed or not, since `jq` can fold and unfold it at will.

I frequently fold logs into single lines, grep for something, then unfold them again

cowsandmilk

2 months ago

Horrid advice at the end about logging every error, exception, slow request, etc if you are sampling healthy requests.

Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?

Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.

46Bit

2 months ago

What we're doing at Cloudflare (including some of what the author works on) samples adaptively. Each log batch is bucketed based on a few fields, and in each bucket if there's lots of logs in each bucket we only keep the sqrt or log of the number of input logs. It works really well... but part of why it works well is we always have blistering rates of logs, so can cope with spikes in event rates without the sampling system itself getting overwhelmed.

otterley

2 months ago

It’s an important architectural requirement for a production service to be able to scale out their log ingestion capabilities to meet demand.

Besides, a little local on-disk buffering goes a long way, and is cheap to boot. It’s an antipattern to flush logs directly over the network.

lanstin

2 months ago

And everything logging from the API to the network to the ingestion pipeline needs to be best effort - configure a capacity and ruthlessly drop msgs as needed, at all stages. Actually a nice case for UDP :)

otterley

2 months ago

It depends. Some cases like auditing require full fidelity. Others don’t.

Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.

lanstin

a month ago

1. those ingested logs are not logs for you, they are customer payload which are business criticial; 2. I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed. Also, the alternative to best effort/shed excessive load isn't 100% availability, it's catastrophic failure when capacity is reached.

Auditing has the requirement to be mostly not lost, but most importantly not being able to be deleted by people on the host. And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on." Hopefully, the audit traffic is consistent enough that you don't get spikes and can over-capacitize with confidence.

otterley

a month ago

> those ingested logs are not logs for you, they are customer payload which are business criticial

Why does that make any difference? Keep in mind that at large enough organizations, even though the company is the same, there will often be an internal observability service team (frequently, but not always, as part of a larger platform team). At a highly-functioning org, this team is run very much like an external service provider.

> I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed.

You should take a look at CloudWatch Logs. I'm unaware of any time in its 17-year history that it has successfully ingested logs and subsequently lost or corrupted them. (Disclaimer: I work for AWS.) Also, I didn't say anything about delays, which we often accept as a tradeoff for durability.

> And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on."

This is one of the many reasons why buffering outgoing logs in memory is an anti-pattern, as I noted earlier in this thread. There should always -- always -- be some sort of non-volatile storage buffer in between a sender and remote receiver. It’s not just about resilience against backpressure; it also means you won’t lose logs if your application or machine crashes. Disk is cheap. Use it.

trevor-e

2 months ago

Yea that was my thought too. I like the idea in principle, but these magic thresholds can really bite you. It claims to be P(99), probably off some historical measurement, but that's only true if it's dynamically changing. Maybe this could periodically query the OTEL provider for the real number to at least limit the time window of something bad happening.

Veserv

2 months ago

I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.

Are you actually contemplating handling 10 million requests per second per core that are failing?

otterley

2 months ago

Generation and publication is just the beginning (never mind the fact that resources consumed by an application to log something are no longer available to do real work). You have to consider the scalability of each component in the logging architecture from end to end. There's ingestion, parsing, transformation, aggregation, derivation, indexing, and storage. Each one of those needs to scale to meet demand.

Veserv

2 months ago

I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?

All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?

otterley

2 months ago

In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.

Veserv

2 months ago

I prefaced all my statements with the assumption that the chosen logging system is not poorly designed and terribly inefficient. Sounds like their logging solutions are poorly designed and terribly inefficient then.

It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.

otterley

2 months ago

At the end of the day, it comes down to what sort of functionality you want out of your observability. Modest needs usually require modest resources: sure, you could just append to log files on your application hosts and ship them to a central aggregator where they're stored as-is. That's cheap and fast, but you won't get a lot of functionality out of it. If you want more, like real-time indexing, transformation, analytics, alerting, etc., it requires more resources. Ain't no such thing as a free lunch.

dpark

2 months ago

Surely you aren’t doing real time indexing, transformation, analytics, etc in the same service that is producing the logs.

A catastrophic increase in logging could certainly take down your log processing pipeline but it should not create cascading failures that compromise your service.

otterley

2 months ago

Of course not. Worst case should be backpressure, which means processing, indexing, and storage delays. Your service might be fine but your visibility will be reduced.

dpark

2 months ago

For sure. Your can definitely tip over your logging pipeline and impact visibility.

I just wanted to make sure we weren’t still talking about “causing a cascading outage due to increased log volumes” as was mentioned above, which would indicate a significant architectural issue.

mrkeen

2 months ago

Damn that's fast! I'm gonna stick my business logic in there instead.

golem14

2 months ago

For high volume services, you can still log a sample of healthy requests, e.g., trace_id mod 100 == 0. That keeps log growth under control. The higher the volume, the smaller percentage you can use.

Cort3z

2 months ago

Just implement exponential backoff for slow requests logging, or some other heuristic, to control it. I definitely agree it is a concern though.

debazel

2 months ago

My impression was that you would apply this filter after the logs have reach your log destination, so there should be no difference for your services unless you host your own log infra, in which case there might be issues on that side. At least that's how we do it with Datadog because ingestion is cheap but indexing and storing logs long term is the expensive part.

XCSme

2 months ago

Good point. It also reminded me of when I was trying to optimize my app for some scenarios, then I realized it's better to optimize it for ALL scenarios, so it works fast and the servers can handle no matter what. To be more specific, I decided NOT to cache any common queries, but instead make sure that all queries are fast as possible.

bob1029

2 months ago

> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally.

I worked with enterprise message bus loggers in semiconductor manufacturing context wherein we had thousands of participants on the message bus. It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.

The logs were mere time series of events. Figuring out the detail about specific events (e.g. a list of all the tools a lot visited) required writing queries into the Oracle monster. You could derive history from the event logs if you had enough patience & disk space, but that would have been very silly given the alternative option. We used them predominantly to establish a casual chain between events when the details are still preliminary. Identifying suspects and such. Actually resolving really complicated business usually requires more than a perfectly detailed log file.

fsniper

2 months ago

At last a sane person. Logs are for identifying the event timeline, not to acquire the whole reqs/resp data. Putting every detail into the logs is -in my experience - makes undertanding issues harder. Logs tell a story. When, what happened, not how or why that happened. Why is in the code, how is in the combination of, data, logs, events, code.

And loosely related, I also dislike log interfaces like elk stack. They make following track of events really hard. Most of the time you do not know what you are loooking for, just a vauge understanding of why you are looking into the logs. So a line passed 3 micro seconds ago maybe your euraka moment, where no search could identify , just intuition and following logs diligently can.

iLoveOncall

2 months ago

> It generated something like 300-400 megabytes per hour. Despite the insane volume we made this work really well using just grep and other basic CLI tools.

400MB of logs an hour is nothing at all, that's why a naive grep can work. You don't even need to rotate your log files frequently in this situation.

cpill

2 months ago

YEAH! If its less than a 100TB as second I don't even get out of bed!

Trufa

2 months ago

Did you mean GB or TB otherwise you’re in the very low end of log volume

zkmon

2 months ago

> Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.

Logs are fine. The job of local logs is to record the talk of a local process. They are doing this fine. Local logs were never meant to give you a picture of what's going on some other server. For such context, you need a transaction tracing that can stitch the story together across all processes involved.

Usually, looking at the logs at right place should lead you to the root cause.

otterley

2 months ago

One of the points the author is trying to make (although he doesn't make it well, and his attitude makes it hard to read) is that logs aren't just for root-causing incidents.

When properly seasoned with context, logs give you useful information like who is impacted (not every incident impacts every customer the same way), correlations between component performance and inputs, and so forth. When connected to analytical engines, logs with rich context can help you figure out things like behaviors that lead to abandonment, the impact of security vulnerability exploits, and much more. And in their never-ending quest to improve their offerings and make more money, product managers love being able to test their theories against real data.

ivan_gammel

2 months ago

It’s a wild violation of SRP to suggest that. Separating concerns is way more efficient. Database can handle audit trail and some key metrics much better, no special tools needed, you can join transaction log with domain tables as a bonus.

otterley

2 months ago

Are you assuming they're all stored identically? If so, that's not necessarily the case.

Once the logs have entered the ingestion endpoint, they can take the most optimal path for their use case. Metrics can be extracted and sent off to a time-series metric database, while logs can be multiplexed to different destinations, including stored raw in cheap archival storage, or matched to schemas, indexed, stored in purpose-built search engines like OpenSearch, and stored "cooked" in Apache Iceberg+Parquet tables for rapid querying with Spark, Trino, or other analytical engines.

Have you ever taken, say, VPC flow logs, saved them in Parquet format, and queried them with DuckDB? I just experimented with this the other day and it was mind-blowingly awesome--and fast. I, for one, am glad the days of writing parsers and report generators myself are over.

ivan_gammel

2 months ago

Good joke.

holoduke

2 months ago

APN/Kibana. All what I need for inspecting logs.

devmor

2 months ago

Shoutout to Kibana. Absolutely my favorite UI tool for trying to figure out what went wrong (and sometimes, IF anything went wrong in the first place)

venturecruelty

2 months ago

[flagged]

efitz

2 months ago

Because of the nature of how software is built and deployed nowadays, it’s generally not possible to write single log entries that tell the “whole story” of “what happened”.

I could write about this for hours, but instead I’ll just discuss two concepts that you need in modern logging: vertical correlation and horizontal correlation.

Within a system, requests tend to go “up” and “down” stacks of software. It is very useful in these scenarios to have “vertical correlation” fields shared between adjacent layers, so that activity in one layer can be unambiguously attributed to activity in the adjacent layers. But sharing such a correlation value requires passing the value between layers, which might be a breaking api change. Occasionally it’s possible to construct a correlation value at each adjacent layer by transforming existing parameters in exactly the same way on the calling side and called side.

Additionally, software on one system converses with software on other systems; in those cases you need to have pairwise correlation values between adjacent peer layers. Again, same limitations apply to carrying such a correlation value via the API or protocol.

Really foresighted devs can anticipate these requirements and generate unique transaction ids that can be shared between machines and up and down the stack.

simonw

2 months ago

I hope registering an entire domain name for a blog post doesn't become a trend. I like linking to things that are likely to last a long time - a personal blog is one thing, but expecting people to keep paying the renewal fee every year for a single article feels less likely to me.

A good alternative here is subdomains, since those don't have an additional annual fee. https://logging-sucks.boristane.com/ could work well here.

jspash

2 months ago

The domain and article just a long-form advert for the author's observability sass. Oddly it's free, but you have to sign up for Cloudlfare, where the author is currently employed.

It's seems they are playing the long-long game!

All snarkiness aside, I learned a few things from the article. And I might even sign up to the sass since I already have a CF account.

flockonus

2 months ago

Sorry, this is not a "blog post" - it's far closer to digital marketing. A lead attractor as the author is trying to sell a service very clearly by the end of his page (no disrespect meant to either).

willempienaar

2 months ago

I kind of agree, but the message in this particular post does border on https://simonwillison.net/2024/Jul/13/give-people-something-...

simonw

2 months ago

Maybe I should have written "Give people something to link to that they can expect to stick around for a very long time"

m3047

2 months ago

I agree with this statement: "Instead of logging what your code is doing, log what happened to this request." but the impression I can't shake is that this person lacks experience, or more likely has a lot of experience doing the same thing over and over.

"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.

Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".

In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.

A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).

Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.

otterley

2 months ago

> Logging is not metrics is not auditing.

I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).

How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.

lll-o-lll

2 months ago

Auditing is fundamentally different because it has different durability and consistency requirements. I can buffer my logs, but I might need to transact my audit.

otterley

2 months ago

For most cases, buffering audit logs on local storage is fine. What matters is that the data is available and durable somewhere in the path, not that it be transactionally durable at the final endpoint.

lll-o-lll

2 months ago

What are we defining as “audit” here? My experience is with regulatory requirements, and “durable” on local storage isn’t enough.

In practice, the audit isn’t really a log, it’s something more akin to database record. The point is that you can’t filter your log stream for audit requirements.

otterley

2 months ago

Take Linux kernel audit logs as an example. So long as they can be persisted to local storage successfully, they are considered durable. That’s been the case since the audit subsystem was first created. In fact, you can configure the kernel to panic as soon as records can no longer be recorded.

Regulators have never dictated where auditable logs must live. Their requirement is that the records in scope are accurate (which implies tamper proof) and that they are accessible. Provided those requirements are met, where the records can be found is irrelevant. It thus follows that if all logs over the union of centralized storage and endpoint storage meet the above requirements then it will satisfy the regulator.

lll-o-lll

2 months ago

> Regulators have never dictated where auditable logs must live.

That’s true. They specify that logs cannot be lost, available for x years, not modifiable, accessible only in y ways, cannot cross various boundaries/borders (depending on gov in question). Or bad things will happen to you (your company).

In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”. You need to know that the record has been made durable (across whatever your durability mechanism is, for example a DB with HA + DR), before progressing to the next step. Depending on the stringency, RPO needs to be zero for audit, which is why I say it is a special case.

I don’t know anything about linux audit, I doubt it has any relevance to regulatory compliance.

otterley

2 months ago

> In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”

As long as the record can be located when it is sought, it does not matter how many copies there are. The regulator will not ask so long as your system is a reasonable one.

Consider that technologies like RAID did not exist once upon a time, and backup copies were latent and expensive. Yet we still considered the storage (which was often just a hard copy on paper) to be sufficient to meet the applicable regulations. If a fire then happened and burned the place down, and all the records were lost, the business would not be sanctioned so long as they took reasonable precautions.

Here, I’m not suggesting that “the record is on a single disk, that ought to be enough.” I am assuming that in the ordinary course of business, there is a working path to getting additional redundant copies made, but those additional copies are temporarily delayed due to overload. No reasonable regulator is going to tell you this is unacceptable.

> Depending on the stringency, RPO needs to be zero for audit

And it is! The record is either in local storage or in central storage.

lll-o-lll

2 months ago

> And it is! The record is either in local storage or in central storage.

But it isn’t! Because there are many hardware failure modes that mean that you aren’t getting your log back.

For the same reason that you need acks=all in Kafka for zero data loss, or synchronous_commit = remote_flush in PostgreSQL, you need to commit your audit log to more than the local disk!

otterley

2 months ago

If your hardware and software can’t guarantee that writes are committed when they say they are, all bets are off. I am assuming a scenario in which your hardware and/or cloud provider doesn’t lie to you.

In the world you describe, you don’t have any durability when the network is impaired. As a purchaser I would not accept such an outcome.

lll-o-lll

2 months ago

It’s about avoiding single points of failure.

> In the world you describe, you don’t have any durability when the network is impaired.

Yes, the real world. If you want durability, a single physical machine is never enough.

This is standard distributed computing, and we’ve had all (most) of the literature and understanding of this since the 70’s. It’s complicated, and painful to get right, which is why people normally default to a DB (or cloud managed service).

The reason this matters for this logging scenario is that I normally don’t care if I lose a bit of logging in a catastrophic failure case. It’s not ideal, but I’m trading RPO for performance. However, when regs say “thou shalt not lose thy data”, I move the other way. Which is why the streams are separate. It does impose an architectural design constraint because audit can’t be treated as a subset of logs.

otterley

2 months ago

> If you want durability, a single physical machine is never enough.

It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities.

Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all.

> when regs say “thou shalt not lose thy data”, I move the other way. Which is why the streams are separate. It does impose an architectural design constraint because audit can’t be treated as a subset of logs.

There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies. Regardless of how you manage them, it doesn't change their fundamental nature. Don't confuse the nature of logs with the level of durability you want to achieve with them. They're orthogonal matters.

lll-o-lll

2 months ago

> It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities.

Certainly you can solve for zero data loss (RPO=0) at the infrastructure level. It involves synchronously replicating that data to a separate physical location. If your threat model includes “fire in the dc”, reliable storage isn’t enough. To survive a site catastrophe with no data loss you must maintain a second, live copy (synchronous replication before ack) in another fault domain.

In practice, to my experience, this is done at the application level rather than trying to do so with infrastructure.

> There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies

It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.

> Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all.

I care about matching the solution to the regulation; which varies considerably by country and use-case. However there are multiple cases I have been involved with where the stipulation was “you must prove you cannot lose this data, even in the case of a site-wide catastrophe”. That’s what RPO zero means. It’s DR, i.e., after a disaster. For nearly everything 15 minutes is good, if not great. Not always.

otterley

2 months ago

> It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.

If you want synchronous replication across fault domains for a specific subset of logs, that’s your choice. My point is that treating them this way doesn’t make them not logs. They’re still logs.

I feel like we’re largely in violent agreement, other than whether you actually need to do this. I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details. As far as I know, no one - no one - has ever been held to account by a regulator for losing covered data due to a catastrophe outside their control, as long as they took reasonable measures to maintain compliance. RPO=0, in my experience, has never been a requirement with strict liability regardless of disaster scenario.

lll-o-lll

2 months ago

> I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details.

I can’t go into details about current cases with my current employer, unfortunately. Ultimately, the requirements go through legal and are subject to back and forth with representatives of the government(s) in question. As I said, the problem isn’t passing an audit, it’s getting the initial approval to implement the solution by demonstrating how the requirement will be satisfied. Also, cloud companies are in the same boat, and aren’t certified for use as a result.

This is the extreme end of when you need to be able to say “x definitely happened” or “y definitely didn’t happen” It’s still a “log” from the applications perspective, but really more of a transactional record that has legal weight. And because you can’t lose it, you can’t send it out the “logging” pipe (which for performance is going to sit in a memory buffer for a bit, a local disk buffer for longer, and then get replicated somewhere central), you send it out a transactional pipe and wait for the ack.

Having a gov tell us “this audit log must survive a dc fire” is a bit unusual, but dealing with the general requirement “we need this data to survive a dc fire”, is just another Tuesday. An audit log is nothing special if you are thinking of it as “data”.

You’re a reliability engineer, have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?

otterley

2 months ago

> have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?

I have been asked this, yes. But when I tell them what the cost would be to implement synchronous replication in terms of resources, performance, and availability, they usually change their minds and decide not to go that route.

chickensong

2 months ago

You could have the log shipper filter events and create a separate audit stream with different behavior and destination.

cluckindan

2 months ago

Really, have sane log message types and include ”audit” as one of them.

Log levels could be considered an anti-pattern.

hu3

2 months ago

I like this. But doesn't it make sense to categorize en Exception thrown as an erro somehow? And a new user registration as an email info?

Perhaps use tags then?

cluckindan

2 months ago

Some kind of ”Error” is of course one of the sane message types. ”Warning” and ”info” might be as well.

”Verbose”, ”debug”, ”trace” and ”silly” are definitely not, as those describe a different thing altogether, and would probably be better instrumented through something like the npm ”debug” package.

Veserv

2 months ago

Saying they are all the same when no fidelity is lost is missing the point. The only distinction between logs, traces, and metrics is literally what to do when fidelity is lost.

If you have insufficient ingestion rate:

Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.

Traces are for correlated sequences of events where the entire sequence needs to be retained to be useful/coherent. You can drop arbitrary whole sequences to stay within ingestion rate.

Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.

If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.

otterley

2 months ago

> If you have insufficient ingestion rate

I would rather fix this problem than every other problem. If I'm seeing backpressure, I'd prefer to buffer locally on disk until the ingestion system can get caught up. If I need to prioritize signal delivery once the backpressure has resolved itself, I can do that locally as well by separating streams (i.e. priority queueing). It doesn't change the fundamental nature of the system, though.

m3047

2 months ago

Good summary IMO.

> You can drop arbitrary logs to stay within ingestion rate.

Another way I've heard this framed in a production environments ingesting a firehose is: you can drop individual logging events because there will always be more.

otterley

2 months ago

It depends. Some cases like auditing require full fidelity. Others don’t. Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.

The right way to think about logs, IMO, is less like diagnostic information and more like business records. If you change the framing of the problem, you might solve it in different way.

Spivak

2 months ago

Slapping on OpenTelemetry actually will solve your problem.

Point #1 isn't true, auto instrumentation exists and is really good. When I integrate OTel I add my own auto instrumentors wherever possible to automatically add lots of context. Which gets into point #2.

Point #2 also isn't true. It can add business context in a hierarchal manner and ship wide events. You shouldn't have to tell every span all the information again. Just where it appears naturally the first time.

Point #3 also also isn't true because OTel libs make it really annoying to just write a log message and very strongly pushes you into a hierarchy of nested context managers.

Like the author's ideal setup is basically using OTel with Honeycomb. You get the querying and everything. And unlike rawdogging wide events all your traces are connected, can span multiple services and do timing for you.

alexwennerberg

2 months ago

If a user request is hitting that many things, in my view, that is a deeply broken architecture.

the_mitsuhiko

2 months ago

> If a user request is hitting that many things, in my view, that is a deeply broken architecture.

If we want it or not, a lot of modern software looks like that. I am also not a particular fan of building software this way, but it's a reality we're facing. In part it's because quite a few services that people used to build in-house are now outsourced to PaaS solutions. Even basic things such as authentication are more and more moving to third parties.

worik

2 months ago

> but it's a reality we're facing.

Yes. Most software is bad

The incentives between managers and technicians are all wrong

Bad software is more profitable, over the time frames managers care about, than good software

the_mitsuhiko

2 months ago

The reason we end up with very complex systems I don't think is because of incentives between "managers and technicians". If I were to put my finger to it, I would assume it's the very technicians who argued themselves into a world where increased complexity and more dependencies is seen as a good thing.

Fighting complexity is deeply unpopular.

0x3f

2 months ago

At least in my place of work, my non-technical manager is actually on board with my crusade against complex nonsense. Mostly because he agrees it would increase feature velocity to not have to touch 5 services per minor feature. The other engineers love the horrific mess they've built. It's almost like they're roleplaying working at Google and I'm ruining the fun.

worik

a month ago

> Fighting complexity is deeply unpopular.

Fighting complexity is literally the job of a computer programmer

It is a hard job, and made much harder by the (usual) disconnect between management and us

dxdm

2 months ago

> If a user request is hitting that many things, in my view, that is a deeply broken architecture.

Things can add up quickly. I wouldn't be surprised if some requests touch a lot of bases.

Here's an example: a user wants to start renting a bike from your public bike sharing service, using the app on their phone.

This could be an app developed by the bike sharing company itself, or a 3rd party app that bundles mobility options like ride sharing and public transport tickets in one place.

You need to authentice the request and figure out which customer account is making the request. Is the account allowed to start a ride? They might be blocked. They might need to confirm the rules first. Is this ride part of a group ride, and is the customer allowed to start multiple rides at once? Let's also get a small deposit by putting a hold of a small sum on their credit card. Or are they a reliable customer? Then let's not bother them. Or is there a fraud risk? And do we need to trigger special code paths to work around known problems for payment authorization for cards issued by this bank?

Everything good so far? Then let's start the ride.

First, let's lock in the necessary data. Which rental pricing did the customer agree to? Is that actually available to this customer, this geographical zone, for this bike, at this time, or do we need to abort with an error? Otherwise, let's remember this, so we can calculate the correct rental fee at the end.

We normally charge an unlock fee in addition to the per-minute price. Are we doing that in this case? If yes, does the customer have any free unlock credit that we need to consume or reserve now, so that the app can correctly show unlock costs if the user wants to start another group ride before this one ends?

Ok, let's unlock the bike and turn on the electric motor. We need to make sure it's ready to be used and talk to the IoT box on the bike, taking into account the kind of bike, kind of box and software version. Maybe this is a multistep process, because the particular lock needs manual action by the customer. The IoT box might have to know that we're in a zone where we throttle the max speed more than usual.

Now let's inform some downstream data aggregators that a ride started successfully. BI (business intelligence) will want to know, and the city might also require us to report this to them. The customer was referred by a friend, and this is their first ride, so now the friend gets his referral bonus in the form of app credit.

Did we change an unrefundable unlock fee? We might want to invoice that already (for whatever reason; otherwise this will happen after the ride). Let's record the revenue, create the invoice data and the PDF, email it, and report this to the country's tax agency, because that's required in the country this ride is starting in.

Or did things go wrong? Is the vehicle broken? Gotta mark it for service to swing by, and let's undo any payment holds. Or did the deposit fail, because the credit card is marked as stolen? Maybe block the customer and see if we have other recent payments using the same card fingerprint that we might want to proactively refund.

That's just off the top of my head, there may be more for a real life case. Some of these may happen synchronously, others may hit a queue or event bus. The point is, they are all tied to a single request.

So, depending on how you cut things, you might need several services that you can deploy and develop independently.

- auth - core customer management, permissions, ToS agreement,

- pricing, - geo zone definitions, - zone rules,

- benefit programs,

- payments and payment provider integration, - app credits, - fraud handling,

- ride management, - vehicle management, - IoT integration,

- invoicing, - emails, - BI integration, - city hall integration, - tax authority integration,

- and an API gateway that fronts the app request.

These do not have to be separate services, but they are separate enough to warrant it. They wouldn't be exactly micro either.

Not every product will be this complicated, but it's also not that out there, I think.

alexwennerberg

2 months ago

This was an excellent explanation of a complex business problem, which would be made far more complex by splitting these out into separate services. Every single 'if' branch you describe could either be a line of code, or a service boundary, which has all the complexity you describe, in addition to the added complexity of:

a. managing an external API+schema for each service

b. managing changes to each service, for example, smooth rollout of a change that impacts behavior across two services

c. error handling on the client side

d. error handling on the server side

e. added latency+compute because a step is crossing a network, being serialized/de-serialized on both ends

f. presuming the services use different databases, performance is now completely shot if you have a new business problem that crosses service boundaries. In practice, this will mean doing a "join" by making some API call to one service and then another API call to another service

In your description of the problem, there is nothing that I would want to split out into a separate service. And to get back to the original problem, it makes it far easier to get all the logging context for a single problem in a single place (attach a request ID to the all logs and see immediately everything that happened as part of that request)

dxdm

2 months ago

That's a good summary of the immediate drawbacks of putting network calls between different parts of the system. You're also right to point out that I gave no good reason why you might want to to incur this overhead.

So what's the point?

I think the missing ingredient is scale: how much are you doing, and maybe also how quickly you got where you are.

The system does a lot, even once in place, there's enough depth and surface to your business and operational concerns that something is always changing. You're going to need people to build, extend and maintain it. You will have multiple teams specializing in different parts of the system. Your monolith is carved into team territories, which are subdivided into quasi-autinomous regions with well-defined boundaries and interfaces.

Having separate services for different regions buys you flexibility in the chosen implementation language. This makes it easier to hire competent people, especially initially, when you need seasoned domain experts to get things started. It also matters later, where you may find it easier to find people to work on your glue code parts of the system, where you may be more relaxed about language choice.

Being able to deploy and scale parts of your service separately can also be a benefit. As I said, things are busy, people check in a lot of code. Not having to redeploy and reinitialize the whole world every few minutes, just because some minor thing changed somewhere is good. Not bringing everything down when inevitably something breaks it also nice. You need some critical parts to be there; but a lot of your system can be gone for a while no problem. Don't let those expendables take down your critical stuff. (Yes, failure modes shift; but there's a difference between having a priority 1 outage every day, or much less frequently. That difference is also measured in developer health.)

About the databases: some of your data is big enough that you don't want to use joins anyway. They have a way of suddenly killing db performance. Those who absolutely need it are on DynamoDb. Some others are still okay with a big Postgres instances, where the large tables are a little bit denormalized. (BI want to do tons of joins, but they sit on their separate lake of data.) There's a lot of small fry that's locally very connected, and has some passing knowledge of the existence some big, important business object, but crucially not its insides. If you get a new business concern, hopefully you cut your services and data around natural business domains, or you will need to do more engineering now. Just like in your monolith, you don't want any code to be able to join any two tables, because that would mean that things are to messy to reason about the system anymore. Mind your foreign keys! In any case, if you need DynamoDb, you'll be facing similar problems in your monolith.

A nice side effect of separate services is that the resist an intermingling of concerns that must be prevented actively in monoliths. People love reaching into things they shouldn't. But that's a small upside against the many disadvantages.

Another small mitigating factor is that a lot of your services will be IO bound and make network requests anyway to perform their functions, the kind that makes the latency from your internal network hop much less of a trade-off.

It's all a trade-off. Don't spin off a service until you know why, and until you have a pretty good idea where to make a cut that's a good balance of contained complexity vs surface area.

Now, do you really need 15 different services? Probably not. But I could see how they could work together well, each of them taking care of some well-defined part of your business domain. There's enough meat there that I would not call things a mistake without a closer look.

This us by no means the only way to do things. All I wanted is show that it can be a reasonable way. I hope there's more reason now.

As for the logging problem: it's not hard to have a standard way to hand around request ids from your gateway, to be put in structured logs.

0x3f

2 months ago

> These do not have to be separate services, but they are separate enough to warrant it.

All of this arises from your failure to question this basic assumption though, doesn't it?

dxdm

2 months ago

> All of this arises from your failure to question this basic assumption though, doesn't it?

Haha, no. "All of this" is a scenario I consider quite realistic in terms of what needs to happen. The question is, how should you split this up, if at all?

Mind that these concerns will be involved in other ways with other requests, serving customers and internal users. There are enough different concerns at different levels of abstraction that you might need different domain experts to develop and maintain them, maybe using different programming languages, depending on who you can get. There will definitely be multiple teams. It may be beneficial to deploy and scale some functions independently; they have different load and availability requirements.

Of course you can slice things differently. Which assumptions have you questioned recently? I think you've been given some material. No need to be rude.

0x3f

2 months ago

I don't think I was rude. You're overcomplicating the architecture here for no good reason. It might be common to do so, but that doesn't make it good practice. And ultimately I think it's your job as a professional to question it, which makes not doing so a form of 'failure'. Sorry if that seems harsh; I'm sharing what I believe to be genuine and valuable wisdom.

Happy to discuss why you think this is all necessary. Open to questioning assumptions of my own too, if you have specifics.

As it is, you're just quoting microservices dogma. Your auth service doesn't need a different programming language from your invoicing system. Nor does it need to be scaled independently. Why would it?

dxdm

2 months ago

Diagnosing "failure" in other people is indeed rude, even if you privately consider it true and an appropriate characterization. It's worse if you do that after jumping to the conclusion that somebody else has not considered something, because they have a different opinion than you. At least that's my conclusion of why you wrote that. (And this paragraph is my return offering of genuine and valuable wisdom.)

Of course you can keep everything together, in just very few large parts, or even a monolith. I've not said otherwise.

My point is that "architecture" is orthogonal to the question of "monolith vs separate services"; the difference there is not architecture, but in cohesion and flexibility.

If you do things right, even inside a monolith you will have things clearly separated into different concerns, with clean interfaces. There are natural service boundaries in your code. (If there aren't, in a system like this, you and the business are in for a world of pain.)

The idea is that you can put network IO between these service boundaries, to trade off cohesion and speed at these boundaries for flexibility between them, which can make the system easier to work with.

Different parts of your system will have different requirements, in terms of criticality, performance and availability; some need more compute, others do more IO, are busy at different times, talk to different special or less special databases. This means they may have different sweet spots for various trade-offs when developing and running them.

For example, you can (can!) use different languages to implement critical components or less critical ones, which gives you a bigger pool to hire competent developers from; competent as developers, but also in the respective business domain. This can help your company off the ground.

(Your IoT and bike people are comfortable in Rust. Payments is doing Python, because they're used to waiting, and also they are the people you found who actually know not to use floats for money and all the other secrets.)

You can scale up one part of your system that needs fast compute without also paying for the part that needs a lot of memory, or some parts of your service can run on cheap spot instaces, while others benefit from a more stable environment.

You can deploy your BI service without taking down everything when the new initialization code starts crash-looping.

(You recover quickly, but in the meantime a lot of your IoT boxes got lonely are now trying to reconnect, which triggers a stampede on your monolith, you need to scale up quickly to keep the important functions running, but the invoicing code fetches a WDSL file from a slow government SOAP service, which is now down, and your cache entry's TTL expired, and you don't even need more invoicing right now... The point is, you have a big system, things happen, and fault lines between components are useful.)

It's a trade-off, in the end.

Do you need 15 services? You already have them. They're not even "micro", just each minding their own part of the business domain. But do they all need their own self-contained server? Probably not, but you might be better off with more than just one single monolith.

But I would not automatically bat an eye to find that somebody separated these whatever-teen services. I don't see that as a grievous error per se, but potentially as the result of valid decisions and trade-offs. The real job is to properly separate these concerns, whether they then live in a monolith or not.

And that's why that request may well touch so many services.

ohans

2 months ago

This was a brilliant write up, and loved the interactivity.

I do think "logs are broken" is a bit overstated. The real problem is unstructured events + weak conventions + poor correlation.

Brilliant write up regardless

tetha

2 months ago

One thing this is missing: Standardization and probably the ECS' idea of "related" fields.

A common problem in a log aggregation is the question if you query for user.id, user_id, userID, buyer.user.id, buyer.id, buyer_user_id, buyer_id, ... Every log aggregation ends up being plagued by this. You need standard field names there, or it becomes a horrible mess.

And for a centralized aggregation, I like ECS' idea of "related". If you have a buyer and a seller, both with user IDs, you'd have a `related.user.id` with both id's in there. This makes it very simple to say "hey, give me everything related to request X" or "give me everything involving user Y in this time frame" (as long as this is kept up to date, naturally)

j-pb

2 months ago

I actually wrote my bachelors on this topic, but instead of going the ECS route (which still has redundant fields in different components) I went in the RDF direction. That system has shifted towards more of a middleware/database hybrid over time (https://github.com/triblespace/triblespace-rs). I always wonder if we'd actually need logging if we had more data-oriented stacks where the logs fall out as a natural byproduct of communication and storage.

ttoinou

2 months ago

I always wondered why we didnt have some kind of fuzzy english words search regexes/tool, that is robust to keyboard typing mistakes, spelling mistake, synonyms, plural, conjugation etc.

theodpHN

2 months ago

Just out of curiosity, how have you seen risk/compliance, regulatory, and audit departments at organizations deal with the disconnect between security and privacy for something like mainframe logging (e.g., JES2, JES3), which is typically inherently governed, and modern distributed logging, which is typically inherently permissive? Both are vastly different approaches, but each is somehow considered 'compliant.' Btw, employees at a company I was at were once investigated for insider trading simply because it was discovered the company used pooled logs that were accessible by production support programmers (the company decided to override the default mainframe security), which was deemed a possible source of insider trading information that could be tapped into by those who had log access (programmers were eventually cleared if it was discovered their small personal trades were immaterial and just coincidental with the company's trading, but the investigation led to uncomfortable confrontations for some!).

nostrademons

2 months ago

Google solved most of these problems around 2005, with tools like LOG_EVERY_N (now part of absl [1]), Dapper [2], and several other tools that aren't public yet. You can trace an individual request through every internal system, view the request/response protobufs, every log that the server emitted, timing details, etc. More to the point, you can share this trace, which means that it's possible for one person to discover the bug, reproduce it, and then have another person in a completely different office/timezone/country debug it, even if the latter cannot reproduce the bug themselves. This has proved hugely useful; just last week I was tasked with reproducing a bug on sparsely-available prerelease hardware so that a distant team could diagnose what went wrong.

The key insight that this article hints at but doesn't quite get too: you should treat your logs as a product whose customers are the rest of the devs in your company. The way you log things is intimately connected with what you want to do with them, and you need to build systems to generate useful insights from the log statements. In some cases it literally is part of the product: many of the machine learning systems that generate recommendations, search results, spam filtering, abuse detection, traffic direction, etc. are all based on the logs for the product, and you need to consider them as first-class citizens that you absolutely cannot break while adding new features. Logs are not just for debugging.

[1] https://absl.readthedocs.io/en/latest/absl.logging.html

[2] https://research.google/pubs/dapper-a-large-scale-distribute...

yoan9224

2 months ago

Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue.

If a user request is hitting that many things, in my view, that is a deeply broken architecture.

I'm building an analytics SaaS and we made the conscious decision to keep it simple: Next.js API routes + Supabase + minimal external services. A single page view hits maybe 3 components max (CDN -> App -> Database).

That said, I agree completely on structured logging with rich context. We include user_id, session_id, and event_type on every log line. Makes debugging infinitely easier.

The "wide events" concept is solid, but the real win is just having consistent, searchable structure. You don't need a revolutionary new paradigm - just stop logging random strings and use JSON with a schema.

exabrial

2 months ago

Our logging guidance is: "Don't write comments, write logs" and that serves us pretty well. The point being, don't write code "clever code", write obvious code, and try to make it similar to everything else thats been done, regardless if you agree with it.

joeyguerra

2 months ago

Persisting a data schema that represents business events is a great idea. That’s more about Event Sourcing though and doing that can answer a ton of questions about the system without doing it in log messages.

Wide events as a strategy is expensive, even with sampling, and doesn’t address the fundamental problem - why do we log messages?

I was hoping the article would enumerate why we log messages. Nailing down those scenarios first will lead to a happy life.

Why do we log? - proof of life - is the system running? - what is the state (in memory) when an error occurred? - when did an error occur? - do I need to get up at 2 am and fix something? - what do I need to fix?

I feel like every team operating a system has their own reasons for logging.

bambax

2 months ago

> Logging Sucks

But does it? Or is it bad logging, or excessive logging, or unsearchable logs?

A client of mine uses SnapLogic, which is a middleware / ETL that's supposed run pipelines in batch mode to pass data around between systems. It generates an enormous amount of logs that are so difficult to access, search and read that they may as well don't exist.

We're replacing all of that with simple Python scripts that do the same thing and generate normal simple logs with simple errors when something's truly wrong or the data is in the wrong format.

Terse logging is what you want, not an exhaustive (and exhausting) torrent of irrelevant information.

roncesvalles

2 months ago

AI slop blogvert. The first example is disingenuous btw. Everyone these days uses requestIDs to be able to query all log lines emanated by a single request, usually set by the first backend service to receive the request and then propagated using headers (and also set in the server response).

There isn't anything radical about his proposed solutions either. Most log storage can be set with a rule where all warning logs or above can be retained, but only a sample of info and debug logs.

The "key insight" is also flawed. The reason why we log at every step is because sometimes your request never completes and it could be for 1000 reasons but you really need to know how far it got in your system. Logging only a summary at the end is happy path thinking.

grekowalski

2 months ago

"Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally."

But the next era will be like the previous one. Today monolith is enough for most of apps.

paragraft

2 months ago

I've recently come off a team that was racking up a huge Splunk bill with ~70 log events for each request on a high traffic service, and this is all very resonant (except the bit about sampling, I never gave that much thought - reducing our Splunk bill 70x was ambitious enough for me!).

Hadn't heard the "wide event" name, but I had settled on the same idea myself in that time (called them "top-level events" - i.e. we would gather information from the duration of the request and only log it at the "top" of the stack at the end), and evangelised them internally mostly on the basis it gave you fantastic correlation ability.

In theory if you've got a trace id in Splunk you can do correlated queries anyway, but we were working in Spring and forever having issues with losing our MDC after doing cross-thread dispatch and forgetting to copy the MDC thread global across. This wasn't obvious from the top-level, and usually only during an incident would you realise you weren't seeing all the loglines you expected for a given trace. So absent a better solution there, tracking debug info more explicitly was appealing.

Also used these top-level events to store sub-durations (e.g. for calling downstream services, invoking a model etc), and with Splunk if you record not just the length of a sub-process but its absolute start, you can reconstruct a hacky waterfall chart of where time was spent in your query.

firefoxd

2 months ago

Good write up.

Gonna go on a tangent here. Why the single purpose domain? Especially since the author has a blog. My blog is full of links to single post domains that are no longer.

OsrsNeedsf2P

2 months ago

Because it's an ad

thewisenerd

2 months ago

it's an ad, for what?

i do not see a product upsell anywhere.

if it's an ad for the author themselves, then it's a very good one.

KomoD

2 months ago

At the end there's a form where you can get a "personalized report", I have a feeling that'll advertise some kind of service, it's usually the case.

ivan_gammel

2 months ago

The problem statement in this article sounds weird. I thought in 2025 everyone logs at least thread id and context id (user id, request id etc), and in microservice architecture at least transaction or saga id. You don’t need structured logging, because grep by this id is sufficient for incident investigation. And for analytics and metrics databases of events and requests make more sense.

0xbadcafebee

2 months ago

> Here's the mental model shift that changes everything: Instead of logging what your code is doing, log what happened to this request.

Yeah that doesn't magically fix everything. Logging is still an arbitrary, clunky, unintuitive process that requires intentional design and extra systems to be useful.

The "Wide Event log" example is 949 bytes, which isn't unmanageably large, but it is 3x larger than most log messages which are about 300 bytes. And in that blob of data might be key insights, but it is left up to an extra engineering process to discover what might be unusual in that blob. It lacks things like code line numbers, stack trace, and context given by the program about its particular functions (rather than assumptions based on a few pieces of metadata). And it's excessively verbose, as it has a trace and request ID and service name, but duplicates information already available to tracing systems based on those 3 metrics.

> Wide events are a philosophy: one comprehensive event per request, with all context attached.

That's simply impossible. You cannot have all context from viewing a single point in the network, regardless of how hard you try to record or pass on information. That's the whole point of tracing: you correlate the context of different network points, specifically because that's the only way to discover the missing details.

> Modern columnar databases (ClickHouse, BigQuery, etc.) are specifically designed for high-cardinality, high-dimensionality data. The tooling has caught up. Your practices should too.

You should not depend on a space shuttle to get to the grocery store. Logging is intended to be an abstracted component which can be built on by other systems. Your app should work just as well running from Docker on your laptop as it does in the cloud.

zX41ZdbW

2 months ago

ClickHouse is a tiny component - a single binary that runs on a laptop.

killme2008

2 months ago

This thread overlaps a lot with "Observability 2.0 and the Database for It" (https://news.ycombinator.com/item?id=43789625). The core claim there is: treat logs/spans as structured "wide events", and build a storage/query layer that can handle high-cardinality events so many metrics become derived views rather than pre-modeled upfront. It also argues the hard part isn't "dump it in S3", it’s indexing/queryability + cost control at scale.

In an agentic AI world this pressure gets worse: telemetry becomes more JSON-ish, more high-cardinality (tool names, model/version, prompt/template IDs, step graphs), and more bursty, so pre-modeling every metric up front breaks down faster.

etamponi

2 months ago

Perhaps it's time to take back the good things from 2005.

Lord_Zero

2 months ago

Structured Logging is not just JSON. It's the use of templates with context. It solves 90% of what this article complains about if you just log the template along with the variables and the message separately. Along with logging the right stuff. IE `"User {username} created order {orderid}"`

hoppp

2 months ago

"Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue."

This right here is the fundamental problem because the way it's done is highly inefficient, complex and I believe only exists so cloud providers can sell their expensive offerings.

A monolith is fine 90% of the time.

sgarland

2 months ago

That, and everyone thinks they have to do things this way, so it’s a terrible cycle.

So many problems would be solved if service calls were IPC instead of network calls.

hoppp

2 months ago

Ye, even the smallest projects nowadays are either serverless or run in kubernetes, they never even think about IPC.

Modern deployment pipelines are very good at deploying services that are running in separate containers, and people hate custom shell scripts, even if they work great.

If it's bespoke and not standardized its harder to maintain that's true, but its also great at reducing operational costs because infrastructure fees change with less reliance on external services . Its a cycle alright.

Hackbraten

2 months ago

> No grep-ing.

How is grep a bad thing? I find myself using it all the time.

I’m not into graphical user interfaces. They overwhelm me. By the time I’ve clicked myself through the GUI or written some horrible proprietary $COMPANY Query Language string, I might have already figured out the bug using tried and tested CLI tools.

duckmysick

2 months ago

Me neither. When I deal with structured logs, I use Structured Query Language, typically with ClickHouse or DuckDB which are CLI tools too.

grep is all right, but sometimes I need to tease out a complex data relationship.

iamwil

2 months ago

Content marketing and lead generation is getting more sneaky

redleader55

2 months ago

The article, AI or not, is extremely naive. It doesn't mention any premise or any problem to solve. Proposes a solution and just goes with it. What if your monster of a event is lost when your service crashes or is lost by the logging library/service/etc? What if you're interested in measuring, post factum, how long each step takes? What if you want to trace a log through several (micro-)services and maybe between a mobile app and some batch job executor that runs once a day?

"Logging sucks" when you don't understand the problem you're trying to solve.

shireboy

2 months ago

Kinda get what he’s saying: provide more metadata with structured logging as opposed to lots of string only logs. Ok, modern logging frameworks steer you towards that anyway. But as a counterpoint: often it can be hard to safely enrich logging like that. In the example they include subscription age, user info, etc. More than once I’ve seen logging code lookup metadata or assume it existed, only to cause perf issues or outright errors as expected data didn’t exist. Similar with sampling, it can be frustrating when the thing you need gets sampled out. In the end “it depends” on scenario, but I still find myself not logging enough or else logging too much

danielfalbo

2 months ago

I see more and more blog posts that contain interactive elements. Despite the general enshittification of the average blog and the internet, this feels like a 'modern' touch that actually adds something valuable to the sufficient ad-free no-popups old blog style.

aidenn0

2 months ago

I prefer narrow entries with an "event ID" attached; you can trivially filter to get all of your items with the same event ID, but it works better with all the tools designed for narrow entries.

So instead of one entry with e.g. 27 elements, you have maybe a dozen entries, all with the same event id and 2-4 elements (including the event ID) each. This also lets you log incrementally; if your server crashes before you log the wide entry you have zero data.

It also lets you adjust the granularity. E.g. for a web service you might have both a session and a request ID.

KaiserPro

2 months ago

I agree that logging suck bollocks, especially when most of the time you really want metrics.

Opentel is great, but sadly A lot of the stuff that I am using doesn't support it.

The thing that made it much more bearable, even easy is loki and a decent log parser.

I know a lot of kids like using SQL to interact with things, loki's explore interface beats the living shit out of SQL (In my opinion) its really simple to just isolate and slice logs interactivly. You can build your query really simply.

It beats splunk/sumo/scuba(facebook's log system) hands down in terms of searchability.

XCSme

2 months ago

I've recently added error tracking to my self-hosted analytics app (UXWizz), and the way I did it is simply add extra events to each user/session. Once you have the concept of a session or user, you can simply attach errors or logs as Events stored for that user. This solves the main problem mentioned in the article, where you don't know what happened, plus being an Event stored in a MySQL database, you can still query it.

Why not simply use Events for logging, instead of plain strings?

yujzgzc

2 months ago

You might also need different systems for low-cardinality, low-latency production monitoring (where you want to throw alerts quickly and high cardinality fields would just get in the way), and medium to long term logging with wide events.

Also if you're going to log wide events, for the sake of the person querying them after you, please don't let your schema be an ad hoc JSON dict of dicts, put some thought into the schema structure (and better have a logging system that enforces the schema).

the__alchemist

2 months ago

From what I gather: This is referring to Web sites or other HTTP applications which are internally implemented as a collection of separate applications/ micro-services?

amai

2 months ago

The author wrote about the topic before: https://boristane.com/blog/observability-wide-events-101/

He used to work for https://baselime.io/ which was aquired by Cloudflare.

juancn

2 months ago

Logging is one tool of many, you need logs, metrics and distributed tracing at the very least for any significant piece of modern infra.

If you really want to get serious, you also want some kind of continuous profiling (like pyroscope) or at the very least some periodic thread dump collector (for serious degradation diagnostics once every couple of minutes is enough).

But logging is still a great tool.

dcminter

2 months ago

I've generally found that structured logs that include a correlation ID make it quite easy to narrow down the general area or exact cause of problems. Usually (in enterprise orgs) via Splunk or Datadog.

Where I've had problems it's usually been one of:

There wasn't anything logged in the error block. A comment saying "never happens" is often discovered later :)

Too much was logged and someone mandated dialing the logging down to save costs. Sigh.

A new thread was started and the thread-local details including the correlation ID got lost, then the error occurred downstream of that. I'd like better solutions for that one.

Edit: Incidentally a correlation ID is not (necessarily) the same thing as a request ID. An API often needs to allow for the caller making multiple calls to achieve an objective; 5 request IDs might be tied to a single correlation ID.

loglog

2 months ago

Java has a solution for the thread problem: Scoped Values [0]. If only the logging+tracing libraries would start using it...

[0] https://openjdk.org/jeps/506

dcminter

2 months ago

Oh, excellent, these slipped under my radar. Sounds extremely promising and I do mostly work in Java!

fny

2 months ago

This seems like a classic time vs space trade off.

Instead of reconstructing a "wide event" from multiple log lines with the same request id, the suggestion seems to be logging wide events repeatedly to simplify reconstruction from request ids.

I personally don't see the advantage, and in either scenario, if you're not logging what's needed your screwed.

adamddev1

2 months ago

> Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue.

And this is why _the internet_ today sucks.

asdev

2 months ago

this is the best lead generation form i've ever seen

thangalin

2 months ago

Use events instead of repetitious logging calls.

https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/

mkarrmann

2 months ago

I broadly agree with the article.

The described pattern is standard in Meta. This, along with the infrastructure and tooling to support it, was the single largest "devx quality of life improvement" in my experience moving to big tech.

groundzeros2015

2 months ago

> An era of monoliths, single servers, and problems you could reproduce locally.

Actually this is the problem. It’s extremely difficult to debug when you do not preserve basic properties that allow you to read code and reason about events.

jwpapi

2 months ago

You get: "Show me all checkout failures for premium users in the last hour where the new checkout flow was enabled, grouped by error code."

I can do this with axiom already and I’ve never worried about how to log.

scolvin

2 months ago

Correction, logging used to suck - now it's fixed https://pydantic.dev/logfire :-)

valyala

2 months ago

Nice article about the usefulness of wide events! It's pity it doesn't name open-source solutions optimized for wide events such as VictoriaLogs.

UltraSane

2 months ago

Splunk is expensive but it makes searching logs so much faster and more effective. I think of it as SQL for unstructured data.

preisschild

2 months ago

loki works great too and is FOSS

UltraSane

2 months ago

We really need an open-source implementation of the Splunk Query Language. The query language is what lets you actually find the few dozen relevant lines out of the billions of lines logged.

nacozarina

2 months ago

AI writing sucks even more, get rekt

gfody

2 months ago

the best implementation of structured logging I've seen is dotnet build's binlogs (https://msbuildlog.com), I would love to see it evolve into a general purpose logging solution

jdpage

2 months ago

Tangential, but I wonder if the given example might be straying a step too far? Normally we want to keep sensitive data out of logs, but the example includes a user.lifetime_value_cents field. I'd want to have a chat with the rest of the business before sticking something like that in logs.

nightpool

2 months ago

In some companies, this type of information is often very important and very easily available to everyone at all levels of the business to help prioritize and understand customer value. I would not consider it "sensitive" in the same way that e.g. PII would be.

jdpage

2 months ago

Good to know! At previous jobs, that information wasn't available to me (and it didn't matter because the customer bases were small enough that every customer was top priority), so I assumed it was considered more sensitive than it perhaps is.

lstroud

2 months ago

Sounds like he’s just asking for an old school Inman style transaction log.

mrkeen

2 months ago

> Your logs are lying to you. Not maliciously. They're just not equipped to tell the truth.

The best way to equip logs to tell the truth is to have other parts of the system consume them as their source of truth.

Firstly: "what the system does" and "what the logs say" can't be two different things.

Secondly: developers can't put less info into the logs than they should, because their feature simply won't work without it.

8n4vidtmkvmk

2 months ago

That doesn't sound like a good plan. You're coupling logging with business logic. I don't want to have to think if i change a debug string am i going to break something.

SoftTalker

2 months ago

You're also assuming your log infrastructure is a lot more durable than most are. Generally, logging is not a guaranteed action. Writing a log message is not normally something where you wait for a disk sync before proceeding. Dropping a log message here or there is not a fatal error. Logs get rotated and deleted automatically. They are designed for retroactive use and best effort event recording, not assumed to be a flawless record of everything the system did.

mrkeen

2 months ago

> You're also assuming your log infrastructure is a lot more durable than most are.

Make actions, not assumptions. Instead of using a one machine storage system, distribute that storage across many machines. Then stop deleting them.

> Dropping a log message here or there is not a fatal error.

I would try to reallocate my effort budget to things that actually need to work.

Drop logging completely, and come back to it once you have a flawless record of everything the system did. The reconsider whether you need it.

mrkeen

2 months ago

> You're coupling logging with business logic

Yes, the system shall not report that "User null was created" if it was actually "User 123 that was created".

String? Not a chance, make a proper type-safe struct. UserCreated { "id": 123}

> I don't want to have to think if i change a debug string am i going to break something.

Good point, you should probably have a unit test somewhere.

andoando

2 months ago

Your logic wouldn't be dependent on a debug string, but some enum in a structured field. Ex, event_type: CREATED_TRANSACTION.

Seeing logging as debugging is flawed imo. A log is technically just a record of what happened in your database.

eterm

2 months ago

Overly dismissive of OTLP without proper substance to the criticism.

tuetuopay

2 months ago

On some languages the tracing frameworks are a godsend. In Rust the instrument macro will automatically record all function arguments as span tags. Plonk anything in e.g jaeger and any full trace can be looked up from pretty much any value.

ardme

2 months ago

Maybe better written and simplified to: “microservices suck”.

otterley

2 months ago

The substance of this post is outstanding.

The framing is not, though. Why does it have to sound so dramatic and provocative? It’s insulting to its audience. Grumpiness, in the long term, is a career-limiting attitude.

rglover

2 months ago

Career-limiting perhaps (if expressing normal human emotion is a minus inside of an organization, it may be time to bail) but some of the best minds I've met/observed were absolute curmudgeons (with purpose—they were properly bothered by a problem and refused to go along with the "sweep it under the rug" behavior).

Sure, I've dealt with plenty of assholes, too, but the grumps are usually just tired of their valid insight being ignored by more foolish, orthogonally incentivized types (read: "playing the game" not "making it work well").

otterley

2 months ago

We've all tolerated the grumpy genius at some point in our careers. Nevertheless, most of us would prefer to work with a person who's both smart and kind over someone who's smart and curmudgeonly. It is possible to be both smart and kind, and I've had the pleasure of working with such people.

Assholes can sap an organization's strength faster than any productive value their intelligence can provide. I'm not suggesting the author is an asshole, though; there's not enough evidence from this post.

b0ringdeveloper

2 months ago

I get the AI feeling from it.

otterley

2 months ago

It might have been AI-assisted, and it might not have been. It doesn’t really matter. The author is ultimately responsible for the end result.

jupin

2 months ago

Some excellent points raised in this article.

theflyerkid2023

2 months ago

Hacking Instagram account

user

2 months ago

[deleted]

user

2 months ago

[deleted]