raggi
6 hours ago
There are a number of concrete problems:
- syscall interfaces are a mess, the primitive APIs are too slow for regular sized packets (~1500 bytes), the overhead is too high. GSO helps but it’s a horrible API, and it’s been buggy even lately due to complexity and poor code standards.
- the syscall costs got even higher with spectre mitigation - and this story likely isn’t over. We need a replacement for the BSD sockets / POSIX APIs they’re terrible this decade. Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
- system udp buffers are far too small by default - they’re much much smaller than their tcp siblings, essentially no one but experts have been using them, and experts just retune stuff.
- udp stack optimizations are possible (such as possible route lookup reuse without connect(2)), gso demonstrates this, though as noted above gso is highly fallible, quite expensive itself, and the design is wholly unnecessarily intricate for what we need, particularly as we want to do this safely from unprivileged userspace.
- several optimizations currently available only work at low/mid-scale, such as connect binding to (potentially) avoid route lookups / GSO only being applicable on a socket without high peer-competition (competing peers result in short offload chains due to single-peer constraints, eroding the overhead wins).
Despite all this, you can implement GSO and get substantial performance improvements, we (tailscale) have on Linux. There will be a need at some point for platforms to increase platform side buffer sizes for lower end systems, high load/concurrency, bdp and so on, but buffers and congestion control are a high complex and sometimes quite sensitive topic - nonetheless, when you have many applications doing this (presumed future state), there will be a need.
JoshTriplett
6 hours ago
> Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
I don't think io_uring is as complex as its reputation suggests. I don't think we need a substantially simpler low-level API; I think we need more high-level APIs built on top of io_uring. (That will also help with portability: we need APIs that can be most efficiently implemented atop io_uring but that work on non-Linux systems.)
raggi
6 hours ago
> I don't think io_uring is as complex as its reputation suggests.
uring is extremely problematic to integrate into many common application / language runtimes and it has been demonstrably difficult to integrate into linux safely and correctly as well, with a continual stream of bugs, security and policy control issues.
in principle a shared memory queue is a reasonable basis for improving the IO cost between applications and IO stacks such as the network or filesystem stacks, but this isn't easy to do well, cf. uring bugs and binder bugs.
arghwhat
3 hours ago
Two things:
One, uring is not extremely problematic to integrate, as it can be chained into a conventional event loop if you want to, or can even be fit into a conventionally blocking design to get localized syscall benefits. That is, you do not need to convert to a fully uring event loop design, even if that would be superior - and it can usually be kept entirely within a (slightly modified) event loop abstraction. The reason it has not yet been implemented is just priority - most stuff isn't bottlenecked on IOPS.
Two, yes you could have e middle-ground. I assume the syscall overhead you call out is the need to send UDP packets one at a time through sendmsg/sendto, rather than doing one big write for several packets worth of data on TCP. An API that allowed you to provide a chain of messages, like sendmsg takes an iovec for data, is possible. But it's also possible to do this already as a tiny blocking wrapper around io_uring, saving you new syscalls.
londons_explore
3 minutes ago
I think you need to look at a common use case and consider how many syscalls you'd like it to take and how many CPU cycles would be reasonable.
Let's take downloading a 1MB jpeg image over QUIC and rendering it on the screen.
I would hope that can be done in about 100k CPU cycles and 20 syscalls, considering that all the jpeg decoding and rendering is going to be hardware accelerated. The decryption is also hardware accelerated.
Unfortunately, no network API allows that right now. The CPU needs to do a substantial amount of processing for every individual packet, in both userspace and kernel space, for receiving the packet and sending the ACK, and there is no 'bulk decrypt' non-blocking API.
Even the data path is troublesome - there should be a way for the data to go straight from the network card to the GPU, with the CPU not even touching it, but we're far from that.
Veserv
2 hours ago
The system call to send multiple UDP packets in a single call has existed since Linux 3.0 over a decade ago[1]: sendmmsg().
arghwhat
2 hours ago
Ah nice, in that case OP's point about syscall overhead is entirely moot. :)
That should really be in the `SEE ALSO` of `man 3 sendmsg`...
JoshTriplett
5 hours ago
> with a continual stream of bugs, security and policy control issues
This has not been true for a long time. There was an early design mistake that made it quite prone to these, but that mistake has been fixed. Unfortunately, the reputational damage will stick around for a while.
raggi
5 hours ago
13 CVEs so far this year afaik
bonzini
5 hours ago
CVE numbers from the Linux CNA are bollocks.
JoshTriplett
5 hours ago
This conversation would be a good one to point them to to show that their policy is not just harmless point-proving, but in fact does cause harm.
For context, to the best of my knowledge the current approach of the Linux CNA is, in keeping with long-standing Linux security policy of "every single fix might be a security fix", to assign CVEs regardless of whether something has any security impact or not.
kuschku
3 hours ago
CVE assignment != security issue
CVE numbers are just a way to ensure everyone is talking about the same bug. Not every security issue has a CVE, not every CVE is a security issue.
Often, a regular bug turns out years later to have been a security issue, or a security issue turns out to have no security impact at all.
If you want a central authority to tell you what to think, just use CVSS instead of the binary "does it have a CVE" metric.
di4na
5 hours ago
I would not call it harm. The use of uring in higher level languages is definitely prone to errors, bugs and security problems
JoshTriplett
5 hours ago
See the context I added to that comment; this is not about security issues, it's about the Linux CNA's absurd approach to CVE assignment for things that aren't CVEs.
raggi
5 hours ago
this is a bit of a distraction, sure the leaks and some of the deadlocks are fairly uninteresting, but the toctou, overflows, uid race/confusion and so on are real issues that shouldn't be dismissed as if they don't exist.
jeffparsons
5 hours ago
I find this surprising, given that my initial response to reading the iouring design was:
1. This is pretty clean and straightforward. 2. This is obviously what we need to decouple a bunch of things without the previous downsides.
What has made it so hard to integrate it into common language runtimes? Do you have examples of where there's been an irreconcilable "impedance mismatch"?
raggi
5 hours ago
https://github.com/tailscale/tailscale/pull/2370 was a practical drive toward this, will not proceed on this path.
much more approachable, boats has written about challenges integrating in rust: https://without.boats/tags/io-uring/
in the most general form: you need a fairly "loose" memory model to integrate the "best" (performance wise) parts, and the "best" (ease of use/forward looking safety) way to integrate requires C library linkage. This is troublesome in most GC languages, and many managed runtimes. There's also the issue that uring being non-portable means that the things it suggests you must do (such as say pinning a buffer pool and making APIs like read not immediate caller allocates) requires a substantially separate API for this platform than for others, or at least substantial reworks over all the existing POSIX modeled APIs - thus back to what I said originally, we need a replacement for POSIX & BSD here, broadly applied.
lukeh
5 hours ago
async/await io_uwring wrappers for languages such as Swift [1] and Rust [2] [3] can improve usability considerably. I'm not super familiar with the Rust wrappers but, I've been using IORingSwift for socket, file and serial I/O for some time now.
[1] https://github.com/PADL/IORingSwift [2] https://github.com/bytedance/monoio [3] https://github.com/tokio-rs/tokio-uring
Diggsey
an hour ago
Historically there have been too many constraints on the Linux syscall interface:
- Performance
- Stability
- Convenience
- Security
This differs from eg. Windows because on Windows the stable interface to the OS is in user-space, not tied to the syscall boundary. This has resulted in unfortunate compromises in the design of various pieces of OS functionality.
Thankfully things like futex and io-uring have dropped the "convenience" constraint from the syscall itself and moved it into user-space. Convenience is still important, but it doesn't need to be a constraint at the lowest level, and shouldn't compromise the other ideals.
modeless
3 hours ago
Seems to me that the real problem is the 1500 byte MTU that hasn't increased in practice in over 40 years.
j16sdiz
an hour ago
The real problem is some so called "sysadmin" drop all ICMP, breaking path mtu discovery.
p_l
an hour ago
For all practical purposes, the internet MTU is lower than ethernet default MTU.
Sometimes for ease of mind I end up clamping it to v6 minimum (1280) just in case .
asmor
3 hours ago
That's on the list that right after we all migrate to IPv6.
SomaticPirate
6 hours ago
What is GSO?
jesperwe
6 hours ago
Generic Segmentation Offload
"GSO gains performance by enabling upper layer applications to process a smaller number of large packets (e.g. MTU size of 64KB), instead of processing higher numbers of small packets (e.g. MTU size of 1500B), thus reducing per-packet overhead."
underdeserver
2 hours ago
This is more the result.
Generally today an Ethernet frame, which is the basic atomic unit of information over the wire, is limited to 1500 bytes (the MTU, or Maximum Transmission Unit).
If you want to send more - the IP layer allows for 64k bytes per IP packet - you need to split the IP packet into multiple (64k / 1500 plus some header overhead) frames. This is called segmentation.
Before GSO the kernel would do that which takes buffering and CPU time to assemble the frame headers. GSO moves this to the ethernet hardware, which is essentially doing the same thing only hardware accelerated and without taking up a CPU core.
throwaway8481
6 hours ago
Generic Segmentation Offload
https://www.kernel.org/doc/html/latest/networking/segmentati...
chaboud
6 hours ago
Likely Generic Segmentation Offload (if memory serves), which is a generalization of TCP segmentation offload.
Basically (hyper simple), the kernel can lump stuff together when working with the network interface, which cuts down on ultra slow hardware interactions.
raggi
6 hours ago
it was originally for the hardware, but it's also valuable on the software side as the cost of syscalls is far too high for packet sized transactions
thorncorona
6 hours ago
presumably generic segmentation offloading
USiBqidmOOkAqRb
2 hours ago
Shipping? Government services online? Piedmont airport? Alcoholics anonymous? Obviously not.
Please introduce your initialisms, if it's not guaranteed that first result in a search will be correct.
cookiengineer
6 hours ago
Say what you want but I bet we'll see lots of eBPF modules being loaded in the future for the very reason you're describing. An ebpf quic module? Why not!
And that scares me, because there's not a single tool that has this on its radar for malware detection/prevention.
raggi
6 hours ago
we can consider ebpf "a solution" when there's even a remote chance you'll be able to do it from an unentitled ios app. somewhat hyperbole, but the point is, this problem is a problem for userspace client applications, and bpf isn't a particularly "good" solution for servers either, it's high cost of authorship for a problem that is easily solvable with a better API to the network stack.
mgaunard
2 hours ago
ebpf is linux technology, you will never be able to do it from iOS.
quotemstr
4 hours ago
> Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
And the kernel has no business providing this middle-layer API. Why should it? Let people grab whatever they need from the ecosystem. Networking should be like Vulkan: it should have a high-performance, flexible API at the systems level with being "easy to use" a non-goal --- and higher-level facilities on top.
astrange
27 minutes ago
The kernel provides networking because it doesn't trust userspace to do it. If you provided a low level networking API you'd have to verify everything a client sends is not malicious or pretending to be from another process. And for the same reason, it'd only work for transmission, not receiving.
That and nobody was able to get performant microkernels working at the time, so we ended up with everything in the monokernel.
If you do trust the client processes then it could be better to just have them read/write IP packets though.