raggi
10 months ago
There are a number of concrete problems:
- syscall interfaces are a mess, the primitive APIs are too slow for regular sized packets (~1500 bytes), the overhead is too high. GSO helps but it’s a horrible API, and it’s been buggy even lately due to complexity and poor code standards.
- the syscall costs got even higher with spectre mitigation - and this story likely isn’t over. We need a replacement for the BSD sockets / POSIX APIs they’re terrible this decade. Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
- system udp buffers are far too small by default - they’re much much smaller than their tcp siblings, essentially no one but experts have been using them, and experts just retune stuff.
- udp stack optimizations are possible (such as possible route lookup reuse without connect(2)), gso demonstrates this, though as noted above gso is highly fallible, quite expensive itself, and the design is wholly unnecessarily intricate for what we need, particularly as we want to do this safely from unprivileged userspace.
- several optimizations currently available only work at low/mid-scale, such as connect binding to (potentially) avoid route lookups / GSO only being applicable on a socket without high peer-competition (competing peers result in short offload chains due to single-peer constraints, eroding the overhead wins).
Despite all this, you can implement GSO and get substantial performance improvements, we (tailscale) have on Linux. There will be a need at some point for platforms to increase platform side buffer sizes for lower end systems, high load/concurrency, bdp and so on, but buffers and congestion control are a high complex and sometimes quite sensitive topic - nonetheless, when you have many applications doing this (presumed future state), there will be a need.
JoshTriplett
10 months ago
> Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
I don't think io_uring is as complex as its reputation suggests. I don't think we need a substantially simpler low-level API; I think we need more high-level APIs built on top of io_uring. (That will also help with portability: we need APIs that can be most efficiently implemented atop io_uring but that work on non-Linux systems.)
raggi
10 months ago
> I don't think io_uring is as complex as its reputation suggests.
uring is extremely problematic to integrate into many common application / language runtimes and it has been demonstrably difficult to integrate into linux safely and correctly as well, with a continual stream of bugs, security and policy control issues.
in principle a shared memory queue is a reasonable basis for improving the IO cost between applications and IO stacks such as the network or filesystem stacks, but this isn't easy to do well, cf. uring bugs and binder bugs.
arghwhat
10 months ago
Two things:
One, uring is not extremely problematic to integrate, as it can be chained into a conventional event loop if you want to, or can even be fit into a conventionally blocking design to get localized syscall benefits. That is, you do not need to convert to a fully uring event loop design, even if that would be superior - and it can usually be kept entirely within a (slightly modified) event loop abstraction. The reason it has not yet been implemented is just priority - most stuff isn't bottlenecked on IOPS.
Two, yes you could have e middle-ground. I assume the syscall overhead you call out is the need to send UDP packets one at a time through sendmsg/sendto, rather than doing one big write for several packets worth of data on TCP. An API that allowed you to provide a chain of messages, like sendmsg takes an iovec for data, is possible. But it's also possible to do this already as a tiny blocking wrapper around io_uring, saving you new syscalls.
Veserv
10 months ago
The system call to send multiple UDP packets in a single call has existed since Linux 3.0 over a decade ago[1]: sendmmsg().
arghwhat
10 months ago
Ah nice, in that case OP's point about syscall overhead is entirely moot. :)
That should really be in the `SEE ALSO` of `man 3 sendmsg`...
wtarreau
10 months ago
There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket. When you have small windows (thank you cubic), you'll just send a few datagrams this way and don't save much.
arghwhat
10 months ago
> There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket.
Hmm? sendmsg takes the destination address in the `struct msghdr` structure, and sendmmsg takes an array of those structures.
At the same time, the discussion of efficiency is about UDP vs. TCP. TCP writes are per socket, to the connected peer, and so UDP has the upper hand here. The concerns were about how TCP allows giving a large buffer to the kernel in a single write that then gets sliced into smaller packets automatically, vs. having to slice it in userspace and call send more, which sendmmsg solves.
(You can of course do single-syscall or even zero-syscall "send to many" with io_uring for any socket type, but that's a different discussion.)
wtarreau
10 months ago
> > There's still the problem of sending to multiple destinations: OK sendmmsg() can send multiple datagrams, but for a given socket.
> Hmm? sendmsg takes the destination address in the `struct msghdr` structure, and sendmmsg takes an array of those structures.
But that's still pointless on a connected socket. And if you're not using connected sockets, you're performing destination lookups for each and every datagram you're trying to send. It also means you're running with small buffers by default (the 212kB default buffers per socket are shared with all your destinations, no longer per destination). Thus normally you want to use connected socket when dealing with UDP in environments having performance requirements.
evntdrvn
10 months ago
patches welcome :p
johnp_
10 months ago
Looks like Mozilla is currently working on implementing `sendmmsg` and `recvmmsg` use in neqo (Mozilla's QUIC implementation) [1].
justincormack
10 months ago
At one point if I remember it didnt actually work, it still just sent one message at a time and returned the length of the first piece of the iovec. Hopefully it got fixed.
londons_explore
10 months ago
I think you need to look at a common use case and consider how many syscalls you'd like it to take and how many CPU cycles would be reasonable.
Let's take downloading a 1MB jpeg image over QUIC and rendering it on the screen.
I would hope that can be done in about 100k CPU cycles and 20 syscalls, considering that all the jpeg decoding and rendering is going to be hardware accelerated. The decryption is also hardware accelerated.
Unfortunately, no network API allows that right now. The CPU needs to do a substantial amount of processing for every individual packet, in both userspace and kernel space, for receiving the packet and sending the ACK, and there is no 'bulk decrypt' non-blocking API.
Even the data path is troublesome - there should be a way for the data to go straight from the network card to the GPU, with the CPU not even touching it, but we're far from that.
arghwhat
10 months ago
There's a few issues here.
1. A 1 MB file is at the very least 64 individually encrypted TLS records (16k max size) sent in sequence, possibly more. So decryption 64 times is the maximum amount of bulk work you can do - this is done to allow streaming verification and decryption in parallel with the download, whereas one big block would have you wait for the very last byte before any processing could start.
2. TLS is still userspace and decryption does not involve the kernel, and thus no syscalls. The benefits of kernel TLS largely focus on servers sending files straight from disk, bypassing userspace for the entire data processing path. This is not really relevant receive-side for something you are actively decoding.
3. JPEG is, to my knowledge, rarely hardware offloaded on desktop, so no syscalls there.
Now, the number of actual syscalls end up being dictated by the speed of the sender, and the tunable receive buffer size. The slower the sender, the more kernel roundtrips you end upo with, which allows you to amortize the processing over a longer period so everything is ready when the last packet is. For a fast enough sender with big enough receive buffers, this could be a single kernel roundtrip.
miohtama
10 months ago
JPEG is not a particular great example. However most video streams and partially hardware decoded. Usually you still need to decode part of the stream, namely entropy coding and metadata, first on the CPU.
immibis
10 months ago
This system call you're asking for already exists - it's called sendmmsg. There is also recvmmsg.
jeffparsons
10 months ago
I find this surprising, given that my initial response to reading the iouring design was:
1. This is pretty clean and straightforward. 2. This is obviously what we need to decouple a bunch of things without the previous downsides.
What has made it so hard to integrate it into common language runtimes? Do you have examples of where there's been an irreconcilable "impedance mismatch"?
raggi
10 months ago
https://github.com/tailscale/tailscale/pull/2370 was a practical drive toward this, will not proceed on this path.
much more approachable, boats has written about challenges integrating in rust: https://without.boats/tags/io-uring/
in the most general form: you need a fairly "loose" memory model to integrate the "best" (performance wise) parts, and the "best" (ease of use/forward looking safety) way to integrate requires C library linkage. This is troublesome in most GC languages, and many managed runtimes. There's also the issue that uring being non-portable means that the things it suggests you must do (such as say pinning a buffer pool and making APIs like read not immediate caller allocates) requires a substantially separate API for this platform than for others, or at least substantial reworks over all the existing POSIX modeled APIs - thus back to what I said originally, we need a replacement for POSIX & BSD here, broadly applied.
gpderetta
10 months ago
I can see how a zero-copy API would be hard to implement on some languages, but you could still implement something on top of io_uring with posix buffer copy semantics , while using batching to decrease syscall overhead.
Zero-copy APIs will necessarily be tricky to implement and use, especially on memory safe languages.
gmokki
10 months ago
I think most GC languages support native/pinned me(at least Java and C# do memory to support talking to kernel or native libraries. The APIs are even quite nice.
neonsunset
10 months ago
Java's off-heap memory and memory segment API is quite dreadful and on the slower side. C# otoh gives you easy and cheap object pinning, malloc/free and stack-allocated buffers.
asveikau
10 months ago
I read the oldest of those blog posts the closest.
Seems like the author points out two things:
1. The lack of rust futures supporting manual cancellation. That doesn't seem like an inevitable choice by rust.
2. Sharing buffers with kernel mode. This is probably a bigger topic.
withoutboats3
10 months ago
Rust's async model can support io-uring fine, it just has to be a different API based on ownership instead of references. (That's the conclusion of my posts you link to.)
JoshTriplett
10 months ago
> with a continual stream of bugs, security and policy control issues
This has not been true for a long time. There was an early design mistake that made it quite prone to these, but that mistake has been fixed. Unfortunately, the reputational damage will stick around for a while.
raggi
10 months ago
13 CVEs so far this year afaik
bonzini
10 months ago
CVE numbers from the Linux CNA are bollocks.
JoshTriplett
10 months ago
This conversation would be a good one to point them to to show that their policy is not just harmless point-proving, but in fact does cause harm.
For context, to the best of my knowledge the current approach of the Linux CNA is, in keeping with long-standing Linux security policy of "every single fix might be a security fix", to assign CVEs regardless of whether something has any security impact or not.
kuschku
10 months ago
CVE assignment != security issue
CVE numbers are just a way to ensure everyone is talking about the same bug. Not every security issue has a CVE, not every CVE is a security issue.
Often, a regular bug turns out years later to have been a security issue, or a security issue turns out to have no security impact at all.
If you want a central authority to tell you what to think, just use CVSS instead of the binary "does it have a CVE" metric.
simiones
10 months ago
This is completely false. The CVE website defines these very clearly:
> The mission of the CVE® Program is to identify, define, and catalog publicly disclosed cybersecurity vulnerabilities [emphasis mine].
In fact, CVE stands for "Common Vulnerabilities and Exposures", again showing that CVE == security issue.
It's of course true that just because your code has an unpatched CVE doesn't automatically mean that your system is vulnerable - other mitigations can be in place to protect it.
kuschku
10 months ago
That's the modern definition, which is rewriting history. Let's look at the actual, original definition:
> The CVE list aspires to describe and name all publicly known facts about computer systems that could allow somebody to violate a reasonable security policy for that system
There's also a decision from the editorial board on this, which said:
> Discussions on the Editorial Board mailing list and during the CVE Review meetings indicate that there is no definition for a "vulnerability" that is acceptable to the entire community. At least two different definitions of vulnerability have arisen and been discussed. There appears to be a universally accepted, historically grounded, "core" definition which deals primarily with specific flaws that directly allow some compromise of the system (a "universal" definition). A broader definition includes problems that don't directly allow compromise, but could be an important component of a successful attack, and are a violation of some security policies (a "contingent" definition).
> In accordance with the original stated requirements for the CVE, the CVE should remain independent of multiple perspectives. Since the definition of "vulnerability" varies so widely depending on context and policy, the CVE should avoid imposing an overly restrictive perspective on the vulnerability definition itself.
For more details, see https://web.archive.org/web/20000526190637fw_/http://www.cve... and https://web.archive.org/web/20020617142755/http://cve.mitre....
Under this definition, any kernel bug that could lead to user-space software acting differently is a CVE. Similarly, all memory management bugs in the kernel justify a CVE, as they could be used as part of an exploit.
simiones
10 months ago
Those two links say that CVEs can be one of two categories: universal vulnerabilities or exposures. But the examples of exposures are not, in any way, "any bug in the kernel". They give specific examples of things which are known to make a system more vulnerable to attack, even if not everyone would agree that they are a problem.
So yes, any CVE is supposed to be a security problem, and it has always been so. Maybe not for your specific system or for your specific security posture, but for someone's.
Extending this to any bugfix is a serious misunderstanding of what an "exposure" means, and it is a serious difference from other CNAs. Linux CNA-assigned CVEs just can't be taken as seriously as normal CNAs.
wtarreau
10 months ago
Nowadays the vast majority of CVEs have nothing to do with security, they're just Curriculum Vitae Enhancers, i.e. a student finding that "with my discovery, if A, B, C and D were granted, I could possibly gain some privileges", despite A/B/C/D being mutually exclusive. That's every days job for any security people to sort out that garbage. So what the kernel does is not worse at all.
skywhopper
10 months ago
That’s definitely not the understanding that literally anyone outside the Linux team has for what a CVE is, including the people who came up with them and run the database. Overloading a well-established mechanism of communicating security issues to just be a registry of Linux bugs is an abuse of an important shared resource. Sure “anything could be a security issue” but in practice, most bugs aren’t, and putting meaningless bugs into the international security issue database is just a waste of everyone’s time and energy to make a very stupid point.
kuschku
10 months ago
> including the people who came up with them
How do you figure that? The original definition of CVE is exactly the same as how Linux approaches it.
Sure, in recent years security consultants have been overloading CVE to mean something else, but that's something to fix, not to keep.
frankjr
10 months ago
Can you post the original definition?
vel0city
10 months ago
Common Vulnerabilities and Exposures
frankjr
10 months ago
Right but I was hoping for a definition which supports OP's claim that "CVE assignment != security issue".
kuschku
10 months ago
Then check out these definitions, from 2000, defined by the CVE editorial board:
> The CVE list aspires to describe and name all publicly known facts about computer systems that could allow somebody to violate a reasonable security policy for that system
As well as:
> Discussions on the Editorial Board mailing list and during the CVE Review meetings indicate that there is no definition for a "vulnerability" that is acceptable to the entire community. At least two different definitions of vulnerability have arisen and been discussed. There appears to be a universally accepted, historically grounded, "core" definition which deals primarily with specific flaws that directly allow some compromise of the system (a "universal" definition). A broader definition includes problems that don't directly allow compromise, but could be an important component of a successful attack, and are a violation of some security policies (a "contingent" definition).
> In accordance with the original stated requirements for the CVE, the CVE should remain independent of multiple perspectives. Since the definition of "vulnerability" varies so widely depending on context and policy, the CVE should avoid imposing an overly restrictive perspective on the vulnerability definition itself.
Under this definition, any kernel bug that could lead to user-space software acting differently is a CVE. Similarly, all memory management bugs in the kernel justify a CVE, as they could be used as part of an exploit.
frankjr
10 months ago
> to violate a reasonable security policy for that system
> with specific flaws that directly allow some compromise of the system
> important component of a successful attack, and are a violation of some security policies
All of these are talking about security issues, not "acting differently".
kuschku
10 months ago
> important component of a successful attack, and are a violation of some security policies
If the kernel returned random values from gettime, that'd lead to tls certificate validation not being reliable anymore. As result, any bug in gettime is certainly worthy of a CVE.
If the kernel shuffled filenames so they'd be returned backwards, apparmor and selinux profiles would break. As result, that'd be worthy of a CVE.
If the kernel has a memory corruption, use after free, use of uninitialized memory or refcounting issue, that's obviously a violation of security best practices and can be used as component in an exploit chain.
Can you now see how almost every kernel bug can and most certainly will be turned into a security issue at some point?
josefx
10 months ago
> All of these are talking about security issues, not "acting differently".
Because no system has been ever taken down by code that behaved different from what it was expected to do? Right? Like http desync attacks, sql escape bypasses, ... . Absolutely no security issue going to be caused by a very minor and by itself very secure difference in behavior.
cryptonector
10 months ago
> that could allow somebody to violate a reasonable security policy for that system
That's "security bug". Please stop saying it's not.
kuschku
10 months ago
As detailed in my sibling reply, by definition that includes any bug in gettime (as that'd affect tls certificate validation), any bug in a filesystem (as that'd affect loading of selinux/apparmor profiles), any bug in eBPF (as that'd affect network filtering), etc.
Additionally, any security bug in the kernel itself, so any use after free, any refcounting bug, any use of uninitialized memory.
Can you now see why pretty much every kernel bug fulfills that definition?
user
10 months ago
di4na
10 months ago
I would not call it harm. The use of uring in higher level languages is definitely prone to errors, bugs and security problems
JoshTriplett
10 months ago
See the context I added to that comment; this is not about security issues, it's about the Linux CNA's absurd approach to CVE assignment for things that aren't CVEs.
tialaramex
10 months ago
I don't agree that it's absurd. I would say it reflects a proper understanding of their situation.
You've doubtless heard Tony Hoare's "There are two ways to write code: write code so simple there are obviously no bugs in it, or write code so complex that there are no obvious bugs in it.". Linux is definitely in the latter category, it's now such a sprawling system that determining whether a bug "really" has security implications is no long a reasonable task compared to just fixing the bug.
The other reason is that Linux is so widely used that almost no assumption made to simplify that above task is definitely correct.
JoshTriplett
10 months ago
That's fine, except that it is thus no longer meaningful to compare CVE count.
hifromwork
10 months ago
I like CVEs, I think Linux approach to CVEs is stupid, but also it was never meaningful to compare CVE count. But I guess it's hard to make people stop doing that, and that's the reason Linux does the thing it does out of spite.
immibis
10 months ago
As I understand it, they adopted this policy because the other policy was also causing harm.
They are right, by the way. When CVEs were used for things like Heartbleed they made sense - you could point to Heartbleed's CVE number and query various information systems about vulnerable systems. When every single possible security fix gets one, AND automated systems are checking the you've patched every single one or else you fail the audit (even ones completely irrelevant to the system, like RCE on an embedded device with no internet access) the system is not doing anything useful - it's deleting value from the world and must be repaired or destroyed.
hifromwork
10 months ago
The problem here are the automated systems and braindead auditors, not the CVE system itself.
immibis
10 months ago
Well, the CVE system itself is only about assigning identifiers, and assigning identifiers unnecessarily couldn't possibly hurt anyone, who isn't misusing the system, unless they're running out of identifiers.
raggi
10 months ago
this is a bit of a distraction, sure the leaks and some of the deadlocks are fairly uninteresting, but the toctou, overflows, uid race/confusion and so on are real issues that shouldn't be dismissed as if they don't exist.
anarazel
10 months ago
FWIW, the biggest problem I've seen with efficiently using io_uring for networking is that none of the popular TLS libraries have a buffer ownership model that really is suitable for asynchronous network IO.
What you'd want is the ability to control the buffer for the "raw network side", so that asynchronous network IO can be performed without having to copy between a raw network buffer and buffers owned by the TLS library.
It also would really help if TLS libraries supported processing multiple TLS records in a batched fashion. Doing roundtrips between app <-> tls library <-> userspace network buffer <-> kernel <-> HW for every 16kB isn't exactly efficient.
lukeh
10 months ago
async/await io_uwring wrappers for languages such as Swift [1] and Rust [2] [3] can improve usability considerably. I'm not super familiar with the Rust wrappers but, I've been using IORingSwift for socket, file and serial I/O for some time now.
[1] https://github.com/PADL/IORingSwift [2] https://github.com/bytedance/monoio [3] https://github.com/tokio-rs/tokio-uring
amluto
10 months ago
Hi, Tailscale person! If you want a fairly straightforward improvement you could make: Tailscale, by default uses SOCK_RAW. And having any raw socket listening at all hurts receive performance systemwide:
https://lore.kernel.org/all/CALCETrVJqj1JJmHJhMoZ3Fuj685Unf=...
It shouldn’t be particularly hard to port over the optimization that prevents this problem for SOCK_PACKET. I’ll get to it eventually (might be quite a while), but I only care about this because of Tailscale, and I don’t have a ton of bandwidth.
bradfitz
10 months ago
BTW, that code changed just recently:
https://github.com/tailscale/tailscale/commit/1c972bc7cbebfc...
It's now a AF_PACKET/SOCK_DGRAM fd as it was originally meant to be.
raggi
10 months ago
Very interesting, thank you. We’ll take a look at this!
Diggsey
10 months ago
Historically there have been too many constraints on the Linux syscall interface:
- Performance
- Stability
- Convenience
- Security
This differs from eg. Windows because on Windows the stable interface to the OS is in user-space, not tied to the syscall boundary. This has resulted in unfortunate compromises in the design of various pieces of OS functionality.
Thankfully things like futex and io-uring have dropped the "convenience" constraint from the syscall itself and moved it into user-space. Convenience is still important, but it doesn't need to be a constraint at the lowest level, and shouldn't compromise the other ideals.
modeless
10 months ago
Seems to me that the real problem is the 1500 byte MTU that hasn't increased in practice in over 40 years.
throw0101c
10 months ago
> Seems to me that the real problem is the 1500 byte MTU that hasn't increased in practice in over 40 years.
As per a sibling comment, 1500 is just for Ethernet (the default, jumbo frames being able to go to (at least) 9000). But the Internet is more than just Ethernet.
If you're on DSL, then RFC 2516 states that PPPoE's MTU is 1492 (and you probably want an MSS of 1452). The PPP, L2TP, and ATM AAL5 standards all have 16-bit length fields allowing for packets up 64k in length. GPON ONT MTU is 2000. The default MTU for LTE is 1428. If you're on an HPC cluster, there's a good chance you're using Infiniband, which goes to 4096.
What are size do you suggest everyone on the planet go to? Who exactly is going to get everyone to switch to the new value?
fallingsquirrel
10 months ago
> What are size do you suggest everyone on the planet go to?
65536
> Who exactly is going to get everyone to switch to the new value?
The same people who got everyone to switch to IPv6. It's a missed opportunity that these migrations weren't done at the same time imho.
It'll take a few decades, sure, but that's how big migrations go. What's the alternative? Making no improvements at all, forever?
0xbadcafebee
10 months ago
> got everyone to switch to IPv6
I have some bad news...
> What's the alternative? Making no improvements at all, forever?
No, sadly. The alternative is what the entire tech world has been doing for the past 15 years: shove "improvements" inside whatever crap we already have because nobody wants to replace the crap.
If IPv6 were made today, it would be tunneled inside an HTTP connection. All the new apps would adopt it, the legacy apps would be abandoned or have shims made, and the whole thing would be inefficient and buggy, but adopted. Since poking my head outside of the tech world and into the wider world, it turns out this is how most of the world works.
MerManMaid
10 months ago
>If IPv6 were made today, it would be tunneled inside an HTTP connection. All the new apps would adopt it, the legacy apps would be abandoned or have shims made, and the whole thing would be inefficient and buggy, but adopted. Since poking my head outside of the tech world and into the wider world, it turns out this is how most of the world works.
What you're suggesting here wouldn't work, wrapping all the addressing information inside HTTP which relies on IP for delivery does not work. It would be the equivalent of sealing all the addressing information for a letter you'd like to send inside the envelope.
throw0101c
10 months ago
> If IPv6 were made today, it would be tunneled inside an HTTP connection.
Given that one of the primary goals of IPv6 was increased address space, how would putting IPv6 in an HTTP connection riding over IPv4 solve that?
0xbadcafebee
10 months ago
Providers would just do Carrier-grade NAT (as they do today) or another wonky solution with a tunnel into different networks as needed. IPv6 is still useful in different circumstances, particularly creating larger private networks. They could basically reimplement WireGuard, with the VPN software doubling as IPv6 router and interface provider. I'm not saying this is a great idea, but it is definitely what someone today would have done (with HTTP as the transport method) if IPv6 didn't exist.
Hikikomori
10 months ago
The internet is mostly ethernet these days (ISP core/edge), last mile connections like DSL and cable already handle a smaller MTU so should be fine with a bigger one.
throw0101c
10 months ago
> The internet is mostly ethernet these days […]
Except for the bajillion mobile devices in people's pockets/purses.
cesarb
10 months ago
> The internet is mostly ethernet these days (ISP core/edge),
A lot of that ISP edge is CPEs with WiFi, which AFAIK limits the MTU to 2304 bytes.
asmor
10 months ago
That's on the list that right after we all migrate to IPv6.
j16sdiz
10 months ago
The real problem is some so called "sysadmin" drop all ICMP, breaking path mtu discovery.
icedchai
10 months ago
The most secure network is one that doesn't pass any traffic at all. ;)
cryptonector
10 months ago
That's why PMTUD (P for Passive) exists.
p_l
10 months ago
For all practical purposes, the internet MTU is lower than ethernet default MTU.
Sometimes for ease of mind I end up clamping it to v6 minimum (1280) just in case .
quotemstr
10 months ago
> Yes, uring is fancy, but there’s a tutorial level API middle ground possible that should be safe and 10x less overhead without resorting to uring level complexity.
And the kernel has no business providing this middle-layer API. Why should it? Let people grab whatever they need from the ecosystem. Networking should be like Vulkan: it should have a high-performance, flexible API at the systems level with being "easy to use" a non-goal --- and higher-level facilities on top.
astrange
10 months ago
The kernel provides networking because it doesn't trust userspace to do it. If you provided a low level networking API you'd have to verify everything a client sends is not malicious or pretending to be from another process. And for the same reason, it'd only work for transmission, not receiving.
That and nobody was able to get performant microkernels working at the time, so we ended up with everything in the monokernel.
If you do trust the client processes then it could be better to just have them read/write IP packets though.
namibj
10 months ago
Also, it is really easy to do the normal IO "syscall wrappers" on top of io_uring instead, even easily exposing a very simple async/await variant of them that splits out the "block on completion (after which just like normal IO the data buffer has been copied into kernel space)" from the rest of the normal IO syscall, which allow pipelining & coalescing of requests.
SomaticPirate
10 months ago
What is GSO?
jesperwe
10 months ago
Generic Segmentation Offload
"GSO gains performance by enabling upper layer applications to process a smaller number of large packets (e.g. MTU size of 64KB), instead of processing higher numbers of small packets (e.g. MTU size of 1500B), thus reducing per-packet overhead."
underdeserver
10 months ago
This is more the result.
Generally today an Ethernet frame, which is the basic atomic unit of information over the wire, is limited to 1500 bytes (the MTU, or Maximum Transmission Unit).
If you want to send more - the IP layer allows for 64k bytes per IP packet - you need to split the IP packet into multiple (64k / 1500 plus some header overhead) frames. This is called segmentation.
Before GSO the kernel would do that which takes buffering and CPU time to assemble the frame headers. GSO moves this to the ethernet hardware, which is essentially doing the same thing only hardware accelerated and without taking up a CPU core.
wtarreau
10 months ago
What you're describing is for TCP. On TCP you can perform a write(64kB) and see the stack send it into 1460 segments. On UDP if you write(64kB) you'll get a single 64kB packet composed of 45 fragments. Needless to say, it suffices that any of them is lost in a buffer somewhere for the whole packet never being received and all of them having to be retransmitted by the application layer.
GSO on UDP allows the application to send a large chunk of data, indicating the MTU to be applied, and lets the kernel pass it down the stack as-is, until the lowest layer that can split it (network stack, driver or hardware). In this case they will make packets, not fragments. On the wire there will really be independent datagrams with different IP IDs. In this case, if any of them is lost, the other ones are still received and the application can focus on retransmitting only the missing one(s). In terms of route lookups, it's as efficient as fragmentation (since there's a single lookup) but it will ensure that what is sent over the wire is usable all along the chain, at a much lower cost than it would be to make the application send all of them individually.
chaboud
10 months ago
Likely Generic Segmentation Offload (if memory serves), which is a generalization of TCP segmentation offload.
Basically (hyper simple), the kernel can lump stuff together when working with the network interface, which cuts down on ultra slow hardware interactions.
raggi
10 months ago
it was originally for the hardware, but it's also valuable on the software side as the cost of syscalls is far too high for packet sized transactions
throwaway8481
10 months ago
Generic Segmentation Offload
https://www.kernel.org/doc/html/latest/networking/segmentati...
thorncorona
10 months ago
presumably generic segmentation offloading
USiBqidmOOkAqRb
10 months ago
Shipping? Government services online? Piedmont airport? Alcoholics anonymous? Obviously not.
Please introduce your initialisms, if it's not guaranteed that first result in a search will be correct.
mh-
10 months ago
> first result in a search will be correct
Searching for GSO network gives you the correct answer in the first result. I'd consider that condition met.
cryptonector
10 months ago
Of these the hardest one to deal with is route lookup caching and reuse w/o connect(2). Obviously the UDP connected TCB can cache that, but if you don't want a "connected" socket fd... then there's nowhere else to cache it except ancillary data, so ancillary data it would have to be. But getting return-to-sender ancillary data on every read (so as to be able to copy it to any sends back to the same peer) adds overhead, so that's not good.
A system call to get that ancillary data adds overhead that can be amortized by having the application cache it, so that's probably the right design, and if it could be combined with sending (so a new flavor of sendto(2)) that would be even better, and it all has to be uring-friendly.
wtarreau
10 months ago
The default UDP buffers of 212kB are indeed a big problem for every client at the moment. You can optimize your server as you want, all your clients will experience losses if they pause for half a millisecond to redraw a tab or update an image, just because the UDP buffers can only store so few packets. That's among the things that must urgently change if we want UDP so start to work well on end-user devices.
cookiengineer
10 months ago
Say what you want but I bet we'll see lots of eBPF modules being loaded in the future for the very reason you're describing. An ebpf quic module? Why not!
And that scares me, because there's not a single tool that has this on its radar for malware detection/prevention.
raggi
10 months ago
we can consider ebpf "a solution" when there's even a remote chance you'll be able to do it from an unentitled ios app. somewhat hyperbole, but the point is, this problem is a problem for userspace client applications, and bpf isn't a particularly "good" solution for servers either, it's high cost of authorship for a problem that is easily solvable with a better API to the network stack.
mgaunard
10 months ago
ebpf is linux technology, you will never be able to do it from iOS.
nly
10 months ago
Anyone who cares about performance is already using NIC accelerated APIs like Onloads or VFI
leshow
10 months ago
which UDP settings do you usually tune?