pizlonator
6 hours ago
Always cool to see new mutex implementations and shootouts between them, but I don’t like how this one is benchmarked. Looks like a microbenchmark.
Most of us who ship fast locks use very large multithreaded programs as our primary way of testing performance. The things that make a mutex fast or slow seem to be different for complex workloads with varied critical section length, varied numbers of threads contending, and varying levels of contention.
(Source: I wrote the fast locks that WebKit uses, I’m the person who invented the ParkingLot abstraction for lock impls (now also used in Rust and Unreal Engine), and I previously did research on fast locks for Java and have a paper about that.)
PaulDavisThe1st
4 hours ago
To add to this, as the original/lead author of a desktop app that frequently runs with many tens of threads, I'd like to see numbers on performance in non-heavily contended cases. As a real-time (audio) programmer, I am more concerned with (for example) the cost to take the mutex even when it is not already locked (which is the overwhelming situation in our app). Likewise, I want to know the cost of a try-lock operation that will fail, not what happens when N threads are contending.
Of course, with Cosmopolitan being open source and all, I could do these measurements myself, but still ...
pizlonator
4 hours ago
Totally!
Pro tip: if you really do know that contention is unlikely, and uncontended acquisition is super important, then it's theoretically impossible to do better than a spinlock.
Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
Spinlocks also have excellent throughput under most contention scenarios, though at the cost of power and being unkind to other apps on the system. If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry (or `SwitchToThread` on Windows, and on Darwin you can do a bit better with `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).
Animats
30 minutes ago
> just make sure you `sched_yield` before each retry
Assuming `sched_yield` does something.
There's a futex congestion problem inside Wine's memory allocator. There are several levels of locks. If you're growing a buffer, in the sense of C's "realloc", and no buffer is available, memory allocation is locked during the allocation of a bigger buffer, copying of the contents, and release of the old buffer. "Push" type operations can force this. Two orders of magnitude performance drops ensue when multi-threaded programs are contending for that lock.[1]
Inside one of the lock loops is a call to "YieldProcessor".
static void spin_lock( LONG *lock )
{
while (InterlockedCompareExchange( lock, -1, 0 ))
YieldProcessor();
}
But the actual code for YieldProcessor is a NOP on x86:[2] static FORCEINLINE void YieldProcessor(void)
{
#ifdef __GNUC__
#if defined(__i386__) || defined(__x86_64__)
__asm__ __volatile__( "rep; nop" : : : "memory" );
#elif defined(__arm__) || defined(__aarch64__)
__asm__ __volatile__( "dmb ishst\n\tyield" : : : "memory" );
#else
__asm__ __volatile__( "" : : : "memory" );
#endif
#endif
}
}Wine devs are aware of this, but the mess is bad enough that no one has tackled it. This is down in the core of what "malloc" calls, so changes there could have unexpected effects on many programs. Needs attention from someone really into mutexes.
[1] https://forum.winehq.org/viewtopic.php?t=37688
[2] https://gitlab.winehq.org/wine/wine/-/blob/HEAD/include/winn...
lilyball
3 hours ago
On Darwin, it's possible for a pure spinlock to produce a priority inversion deadlock, because Darwin has a quality of service implementation in the kernel that differs from how everyone else handles thread priority. In other kernels, a low-priority thread will still eventually be guaranteed a cpu slice, so if it's holding a spinlock, it will eventually make progress and unlock. On Darwin with Quality of Service, it's possible for higher-QoS threads to preempt lower-QoS threads indefinitely.
For this reason, on Darwin you want to avoid spinlocks unless you know that all threads touching the spinlock are always running in the same QoS. Instead of spinlocks, your go-to for low-overhead locks there is os_unfair_lock, which is a spinlock variant that donates priority of the blocked threads to the running thread.
pizlonator
3 hours ago
I’ve shipped code on Darwin that spinlocks and gets away with it without any noticeable cases of this happening.
I know it can happen in theory. But theory and practice ain’t the same.
I worked for Apple when I shipped this too lmao
PaulDavisThe1st
4 hours ago
We use spinlocks where appropriate. In the 90s I recall that the general rule of thumb was if the lock is held for <10x the context switch time, spinlocks are generally a better choice. Not sure if that's still true of contemporary architectures.
The more common pattern in rt/audio code is "try to take the lock, but have an alternate code path if that fails". It's not that is never going to be contention, but it will be extremely rare, and when it occurs, it probably matters. RWLocks are also a common pattern, with the RT thread(s) being read-only (and still being required to fall back on an alternate code path if the read lock cannot be taken).
pizlonator
3 hours ago
These days, fast lock implementations use the following rough idiom, or some idiom that is demonstrably not any slower even for short critical sections.
if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
for (unsigned i = 0; i < 40; ++i) {
/* spin for a bounded a mount of time */
if (LIKELY(CAS(&lock, UNLOCKED, LOCKED))) return;
sched_yield();
}
... /* actual full lock algo goes here */
So, the reason to use spinlocks isn't that they are faster for short critical sections, but that they don't have to CAS on unlock - and so they are faster especially in the uncontended case (and in all other cases too).In other words, the modern rule of thumb is something like: if you're going to grab the lock so frequently that the uncontended lock/unlock time shows up as a significant percentage of your execution time, then use a spinlock. That probably implies that you are probably holding the lock for <100x or even <1000x context switch time, or something around there.
adastra22
an hour ago
Huh, interesting. TIL. Thanks for that!
ot
an hour ago
> Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
This is something I've thinking about a lot over time, that the CAS is only there to atomically determine if there are any sleeping waiters on unlock and you have to do a futex_wake. I would really want some way to get away with non-fenced operations (at least on x86), but I don't know if it's just that nobody has figured out why, or there is a fundamental reason why that's not possible.
pizlonator
39 minutes ago
You do need a fence in the unlock path though (at least a release fence).
I think the issue is that if you ask the CPU to just store something (like in a spin lock), whether or not there’s a fence, it’s an operation with limited data flow dependencies so it’s easy for the CPU to execute. Even the fence can be handled using wacky speculation tricks.
But if you want to do something like, “store this value but only if the old value satisfies some predicate”, then there’s a load and the whole thing is dependent on the load. So you’re asking the CPU to load, then run a predicate, then store, and for that to be fenced, and atomic.
Strictly more work. I don’t think there’s any trick to make it faster than just the store release.
ataylor284_
4 hours ago
The writeup suggests this implementation is optimized for the not-locked case:
> uses an optimistic CAS (compare and swap) immediately, so that locking happens quickly when there's no contention
ngoldbaum
4 hours ago
This style of mutex will also power PyMutex in Python 3.13. I have real-world benchmarks showing how much faster PyMutex is than the old PyThread_type_lock that was available before 3.13.
0xDEADFED5
2 hours ago
Can I use PyMutex from my own Python code?
miohtama
3 hours ago
Any rough summary?
ngoldbaum
3 hours ago
https://github.com/numpy/numpy/issues/26510#issuecomment-229...
And now that I look at that again I realize I forgot to finish that up!
uvdn7
4 hours ago
I was thinking the same. There are many mutexes out there, some are better at certain workloads than the rest. DistributedMutex and SharedMutex come to mind (https://github.com/facebook/folly/blob/main/folly/synchroniz..., https://github.com/facebook/folly/blob/main/folly/SharedMute...) Just like hashmaps, it's rarely the case that a single hashmap is better under _all_ possible workloads.
pizlonator
4 hours ago
Yeah.
I should say, though, that if you're on Windows then I have yet to find a real workload where SRWLock isn't the fastest (provided you're fine with no recursion and with a lock that is word-sized). That lock has made some kind of deal with the devil AFAICT.
ot
an hour ago
Yeah, that specific benchmark is actually likely to prefer undesirable behaviors, for example pathological unfairness: clearly the optimal scheduling of those threads runs first all the increments from the first thread, then all of the second thread, etc... because this will minimize inter-processor traffic.
A mutex that sleeps for a fixed amount (for example 100us) on lock failure acquisition will probably get very close to that behavior (since it almost always bunches), and "win" the benchmark. Still, that would be a terrible mutex for any practical application where there is any contention.
This is not to say that this mutex is not good (or that pthread mutexes are not bad), just that the microbenchmark in question does not measure anything that predicts performance in a real application.
pizlonator
an hour ago
Yeah! For all we know this performs great on real programs.
And for all we know it’s absolute trash on real programs.
intelVISA
3 hours ago
Fast locks are an oxymoron vs optimized CAS