hardwaresofton
9 months ago
Been doing some IPC experiments recently following the 3tilley post[0], because there just isn't enough definitive information (even if it's a snapshot in time) out there.
Shared memory is crazy fast, and I'm surprised that there aren't more things that take advantage of it. Super odd that gRPC doesn't do shared memory, and basically never plans to?[1].
All that said, the constructive criticism I can offer for this post is that in mass-consumption announcements like this one for your project, you should:
- RPC throughput (with the usual caveats/disclaimers) - Comparison (ideally graphed) to an alternative approach (ex. domain sockets) - Your best/most concise & expressive usage snippet
100ns is great to know, but I would really like to know how much RPC/s this translates to without doing the math, or seeing it with realistic de-serialization on the other end.
a_t48
9 months ago
In my experience shared memory is really hard to implement well and manage:
1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).
2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy".
3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.
4. There's other separate questions around how to manage the shared memory segments you are using (one big ring buffer? a segment per message?), and how to communicate between processes that different segments are in use and that new messages are available for subscribers. Doable, but also not simple.
It's a tough pill to swallow - you're taking on a lot of complexity in exchange for that low latency. If you can do so, it's better to put things in the same process space if you can - you can use smart pointers and a queue and go just as fast, with less complexity. Anything CUDA will want to be single process, anyhow, (ignoring cuda IPC, anyhow). The number of places where you need (a) ultra low latency (b) high bandwidth/message size (c) can't put everything in the same process (d) are using data structures suited to shared memory and finally (e) are okay with taking on a bunch of complexity just isn't that high. (It's totally possible I'm missing a Linux feature that makes things easy, though).
I plan on integrating iceoryx into a message passing framework I'm working on now (users will ask for SHM), but honestly either "shared pointers and a queue" or "TCP/UDS" are usually better fits.
elBoberido
9 months ago
> In my experience shared memory is really hard to implement well and manage:
I second that. It took us quite some time to get the correct architecture. After all, iceoryx2 is the third incarnation of this piece of software, with elfepiff an me working on the last two.
> 1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).
Indeed, we are using fixed size structures with a bucket allocator. We have ideas on how to enable the usage on types which support custom allocators and even with raw pointers but that is just a crazy idea which might not pan out to work.
> 2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy". > > 3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.
Indeed, this is a complicated topic and support from the OS would be appreciated. We found a few ways on how to make this feasible, though.
The origins of iceoryx are in automotive and there it is required to split functionality up into multiple processes. When one process goes down, the system can still operate in a degraded mode or just restart the faulty process. With this, one needs an efficient and low-latency solution else the CPU is spending more time on copying data than on doing real work.
Of course there are issues like the producer mutating data after delivery, but here are also solutions for this. It will of course affect the latency but should still be better than using e.g. unix domain sockets.
Fun fact. For iceoryx1 we supported only 4GB memory chunks and some time ago someone came and asked if we could lift this limitation since he wanted to transfer a 92GB large language model via shared memory.
hardwaresofton
9 months ago
Thanks for sharing here -- yeah these are definitely huge issues that make shared memory hard -- the when-things-go-wrong case is definitely quite hairy.
I wonder if it would work well as a sort of opt-in specialization? Start with TCP/UDS/STDIN/whatever, and then maybe graduate, and if anything goes wrong, report errors via the fallback?
I do agree it's rarely worth it (and same-machine UDS is probably good enough), but with the 10x gain essentially I'm quite surprised.
One thing I've also found that actually performed very well is ipc-channel[0]. I tried it because I wanted to see how something I might actually use would perform, and it was basically 1/10th the perf of shared memory.
a_t48
9 months ago
The other thing is 10x improvement on basically nothing is quite small. Whatever time it takes for a message to be processed is going to be dominated by actually consuming the message. If you have a great abstraction, cool - use it anyhow, but it's probably not worth developing a shared memory library yourself.
hardwaresofton
9 months ago
Agree, but it's also a question of where do you start from -- 10x is a lot to give up, and knowing you're giving it up is pretty important.
That said, the people who built iceoryx2 obviously believe it's worth investing in, which is interesting.
a_t48
9 months ago
I'm glad the team is putting in the work to get it working for the rest of us.
abhirag
9 months ago
At $work we are evaluating different IPC strategies in Rust. My colleague expanded upon 3tilley's work, they have updated benchmarks with iceoryx2 included here[0]. I suppose the current release should perform even better.
nh2
9 months ago
Interesting that on Linux Unix Domain Sockets are not faster than TCP.
People often say that the TCP stack overhead is high but this benchmark does not confirm that.
jcelerier
9 months ago
I'm curious about the benchmark. In my own for another network IPC library (https://GitHub.com/ossia/libossia) Unix sockets were consistently faster than the alternatives when sending the same payloads.
billywhizz
9 months ago
the linux results are for a vm running on macos. not sure how useful that is. i certainly wouldn't draw any wider conclusions from them without trying to reproduce yourself. pretty sure they will be very different on bare metal.
billywhizz
9 months ago
i couldn't resist reproducing on bare metal linux (8th gen core i5, ubuntu 22.04):
cargo run --release -- -n 1000000 --method unixstream
cargo run --release -- -n 1000000 --method tcp
~9μs/op for unixstream, ~14μs/op for TCP.unixstream utilizes two cores at ~78% each core, tcp only utilizes ~58% of each core. so there is also something wrong in the benchmarks where blocking is happening and cores are not being fully utilized.
pranitha_m
9 months ago
Our prod env is comprised of cloud VMs so tried to replicate that. Have some benchmarks from prod env, will share those.
hardwaresofton
9 months ago
Excellent writeup! I performed just about the same test, but I didn't see 13M rps in my testing, shared memory went up to about 1M.
That said, I made sure to include serialization/deserialization (and JSON at that) to see what a realistic workload might be like.
pranitha_m
9 months ago
Thanks! I am updating the benchmarks to take into account the different payload sizes as we speak, maybe we can discuss the difference in our methodologies and results then? I'll drop you a mail once updated.
elBoberido
9 months ago
Sweet. Can we link to your benchmark from the main iceoryx2 readme?
pranitha_m
9 months ago
Alright, I am updating the benchmarks for the newest release, I'll open a PR once done.
elBoberido
9 months ago
Thanks, really appreciate it.
sischoel
9 months ago
I was looking a bit at the code for the shared memory implementation in https://github.com/3tilley/rust-experiments/tree/master/ipc and the dependency <https://github.com/elast0ny/raw_sync-rs.
My last systems programming class was already a few years ago and I am a bit rusty, so I got some questions:
1. Looking at the code in https://github.com/elast0ny/raw_sync-rs/blob/master/src/even...) it looks like we are using a userspace spinlock. Aren't these really bad because the mess with the process scheduler and might unnecessarily trigger the scaling governor to increase the cpu frequency? I think at least on linux one could use a semaphore to inform the consumer that new data has been produced.
2. What kind of memory guarantees do we have on modern computer architectures such as x86-64 and ARM? If the producers does two writes (I imagine first the data and then the release of the lock) - is it guaranteed that when the consumer reads the second value that also the first value has been synchronized?
elBoberido
9 months ago
I'm not sure I fully understand what you mean? Do you assume we implemented the same approach for shared memory communication like described in the blog post?
If that’s the case, I want to reassure you that we don’t use locks. Quite the contrary, we use lock-free[1] algorithm to implement the queues. We cannot use locks for the reason you mentioned and also for cases when an application dies while holding the lock. This would result in a deadlock and cannot be used in a safety critical environment. Btw, there are already cars out there which are using a predecessor of iceoryx to distribute camera data in an ECU.
For hard realtime systems we have a wait-free queue. This gives even more guarantees. Lock-free algorithms often have a CAS loop (compare and swap), which in theory can lead to starvation but it's practically unlikely as long as your system does not run at 100% CPU utilization all the time. As a young company, we cannot open source everything immediately, so the wait-free queue will be part of a commercial support package, together with more sophisticated tooling, like teased in the blog post.
Regarding memory guarantees. There are essentially the same guarantees like what you have when sharing an Arc<T> via a Rust channel. After publishing the producer releases the ownership to the subscriber and they have read-only access for as long as they hold the sample. When the sample is dropped by all subscriber, it will be released back to the shared memory allocator.
Btw, we also have an event signalling mechanism to not poll the queue but wait until the producer signals that new data is available. But this requires a context switch and it is up to the user to decide if it is desired to have.
elBoberido
9 months ago
Thanks for the tips. We have a comparison with message queues and unix domain sockets [1] on the repo on github [2].
~~It's nice to see that independent benchmarks are in the same ballpark than the one we perform.~~ Edit: sorry, I confused your link with another one which also has ping-pong in its title
We provide data types which are shared memory compatible, which means one does not have to serialize/deserialize. For image or lidar data, one also does not have to serialize and this is where copying large data really hurts. But you are right, if your data structures are not shared memory compatible, one has to serialize the data first and this has its cost, depending on what serialization format one uses. iceoryx is agnostic to this though and one can select what's the best for a given use case.
[1]: https://raw.githubusercontent.com/eclipse-iceoryx/iceoryx2/r... [2]: https://github.com/eclipse-iceoryx/iceoryx2
pjmlp
9 months ago
Yeah, I think it is about time we re-focus on multi-processing as extension mechanism, given the available hardware we have nowadays.
Loading in-process plugins was a great idea 20 - 30 years ago, however it has been proven that is isn't such a great idea in regards to host stability, or exposed to possible security exploits.
And shared memory is a good compromise between both models.
elBoberido
9 months ago
Indeed. That's our goal :)