rwmj
7 months ago
By far the largest issue with errno is that we don't record where inside the kernel the error gets set (or "raised" if this was a non-C language). We had a real customer case recently where a write call was returning ENOSPC, even though the filesystem did not seem to have run out of space, and searching for the place where that error got raised was a multi-week journey.
In Linux it'd be tough to implement this because errors are usually raised as a side effect of returning some negative value, but also because you have code like:
err = -EIO;
... nothing else sets err here ...
return err;
But instrumenting every function that returns a negative int would be impossible (and wrong). And there are also cases where the error is saved in (eg) a bottom half and returned in the next available system call.koverstreet
7 months ago
This is such a debugging timesink that I've added two things to bcachefs to address it.
- private error codes; error codes are unique, corresponding to a specific source location. They're mapped to standard error codes when we exit from the bcachefs module; in bcachefs they make for much more useful error messages.
- error_throw tracepoint: any time we throw an error, we invoke a standard tracepoint. Extremely useful for debugging wonky-but-not-broken behavior.
saagarjha
7 months ago
Yeah, I've had to kernel debug my way through this several times. It sucks greatly.
mort96
7 months ago
Same. I usually end up adding a ton of pr_warn statements to kernel code and progressively dig deeper and deeper as I figure out which path every function ends up taking, with a kernel recompile + re-flashing + rebooting in between every step. I sorely wish the kernel had better instrumentation for figuring out where error returns originate from.
There are debug prints in some places which can be enabled (if you're lucky, your kernel is even compiled to let you enable those at runtime, without recompiling!), however most of the kernel clearly does not have a culture of debug logging whenever it returns an error.
saagarjha
7 months ago
In theory I guess you can eBPF your way to this but I haven't really tried it much.
__turbobrew__
7 months ago
> We had a real customer case recently where a write call was returning ENOSPC, even though the filesystem did not seem to have run out of space
ext4 reserved blocks?