hackernews client

Restartable Sequences

178 pointsposted 11 hours ago

51 Comments

GlenTheMachine

10 hours ago

If you had no idea what a restorable sequence is the takeaway is about halfway down the OP:

“This is why Linux now provides rseq() which is a much more enlightened solution. With restartable sequences, you actually can get rid of both the mutex and atomics, while the OS continues to fully abstract scheduling. The way it works is you advise the kernel whenever your program enters a critical section of code that you don't want interrupted. It's probably going to be maybe 10 assembly instructions tops. The first assembly opcode should be a move instruction that sets the rseq_cs field. The last instruction needs to be the thing that makes the modification to your global data structure. Think of it sort of like a really tiny database transaction. What makes it go fast, is that the bidirectional communication with the kernel happens via shared memory.”

jstimpfle

2 hours ago

I think it wasn't explained in a very accessible way. If I got the gist right, this essentially brings "per-CPU" synchronization to userland. It's typical in the kernel to have per-cpu data, while per-thread data is rare and typically impractical. There is a high number of threads managed by the kernel, most of which probably belong to a userland process, most of which do not participate in any given synchronisation scheme. Also threads are often too much of an abstraction for parallel programming needs, given that they are hiding for example cache effects. So it's natural to want to use per-cpu data instead of thread_local data in a userland process, I know I've been wishing for that many times.

With rseq, we can allocate in any userland process one instance of a given synchronisation data structure per each CPU. It's important to understand that userland code accessing per-cpu data structures cannot prevent being scheduled away from a CPU and being replaced by another thread (kernel code can block scheduler for short critical sections). Such a replacement thread may subsequently corrupt that same data that was still in the middle of the transaction. But we can make a subset of transactions safe at least: If a transaction gets committed in a single (final) atomic instruction, and we get kernel support for this transaction to be restarted in case there has been a schedule mid-way, this is a guarantee that at the time of commit, the entire transaction hasn't been interrupted by the scheduler. I.e. a kind of "mutual exclusion" guarantee.

Did I get that right?

bryanlarsen

8 hours ago

That doesn't really explain it though, IMO. IIUC, it's a sequence of instructions that either runs to completion atomically or doesn't. If it is interrupted by anything the kernel jumps you to the abort/retry vector you set with a guarantee that the last instruction in the sequence was not executed.

(Based on my reading of the LWN article rwmj posted).

crote

6 hours ago

> it's a sequence of instructions that either runs to completion atomically or doesn't

The way I read it, it either runs to completion in one go, or gets restarted from the beginning. This means the sequence as a whole isn't executed atomically, as the already-executed instructions during an interrupt aren't rolled back.

It can be used to build atomic actions, but it is up to the developer to create a sequence of instructions where the very last instruction "commits" the entire operation, with the side-effects of partial execution being harmless.

bryanlarsen

4 hours ago

Yes, it's either atomic or the last instruction is guaranteed not to have run. I made this a little harder to read by inserting another clause in the sentence.

khuey

8 hours ago

Yes, the API contract isn't "don't interrupt me during this critical section" it's "if you have to interrupt me during this critical section, go to this recovery/restart code".

There is a time-slice extension feature in the works that's roughly "please let me finish this critical section before you interrupt me". But a hard guarantee that userspace code won't be interrupted is probably untenable in a preemptive multitasking system.

leoc

6 hours ago

Something not completely dissimilar to https://en.wikipedia.org/wiki/PCLSRing then?

rwmj

10 hours ago

LWN has a good article: https://lwn.net/Articles/1033955/

manoDev

10 hours ago

That’s clever — am I right to think it’s the intermediate solution between locks and full STM, implemented at the kernel level, and with zero abstraction cost?

khuey

10 hours ago

It's in some sense a light form of STM. The key insight behind rseq(2) is that if the data is local to a given CPU the only way to get a race is if the kernel deschedules your program from that CPU at an inopportune time. If your operation can be aborted and restarted and the kernel has a mechanism to notify you when that needs to happen you can dispense with the overhead of "real" synchronization and just use a couple mov instructions to enter and exit the critical section.

senderista

8 hours ago

I'm surprised there was no reference to the librseq library, maintained by the rseq implementer:

https://github.com/compudj/librseq

This has helpers for common use cases like counters and linked lists. You shouldn't need to write assembly at all to use rseq in most applications.

wmf

3 hours ago

Justine is writing her own libc and her own malloc so I'm not surprised she wants to use rseq from scratch.

senderista

2 hours ago

That's fine, but I think an article claiming to give an introduction to a technology should at least mention that an essential library exists, and that writing assembly is no longer usually required.

khuey

10 hours ago

Maybe I'm just getting old but the "if you don't spend $20,000 on a workstation you're going to be left behind like a dinosaur" at the top of this article is a huge turn off to reading any further. And I say that as someone who owns a workstation with more cores than the author's.

loeg

3 hours ago

I wouldn't read it too literally. I read it as jokingly rationalizing an obviously overkill self-indulgent purchase.

Avicebron

7 hours ago

The author was also asking for money to buy a house in SF and travel on private planes like a few days ago..the donation must have really showed up if they are using 20k machines at home.

wasabi991011

6 hours ago

She bought the workstation at a discount (see bottom of TFA).

Also, it was pre-2025 (before she got her job at Gradient Canopy), so long before asking for donations.

Finally, if you read the actual donation request, you can see she is trying to make a living doing open source, and is being honest about what the money is going to. Why is that an issue?

crote

5 hours ago

There's quite a bit of a spectrum between "trying to make a living doing open source" and "asking for people to pay for a house in one of the most expensive cities in the country - plus a private jet. It's also quite grating to see it written like we should be grateful that we are even allowed to donate to her.

And if she's even half the genius she's claiming to be, why aren't the big tech companies in a bidding war over who get to pay her a million-dollar salary?

From what I've read of her in the past she seems to be a pretty damn good developer. But in the open source world those are a dime a dozen. If you want to make a living off of it you've got to market yourself, and this... isn't how you do that.

TimTheTinker

4 hours ago

> From what I've read of her in the past she seems to be a pretty damn good developer. But in the open source world those are a dime a dozen

Not exactly. Very few people in recent decades have achieved anything comparable to αcτµαlly pδrταblε εxεcµταblε and Cosmopolitan libc - they're in the category of "that should not even be possible". Of course, Tunney's work doesn't touch Fabrice Bellard in terms of sheer breadth and impact, but they're arguably in the same category.

blauditore

7 hours ago

Do you have a link?

insamniac

7 hours ago

https://web.archive.org/web/20260529122658/https://justine.l...

nick__m

3 hours ago

That's reads like an egotic cocaine fueled rant mixed with a PBS pledges drive.

I understand why the author deleted it !

blauditore

5 hours ago

I was wondering if the author is joking, but after reading a bit more about the attribution drama, it seems rather a lack of reality check and reflection. If you plagiarize work, get called out on it, and then call this "harrassment", I don't know...

nutjob2

7 hours ago

> I have nothing against this person.

Your bad faith reading of that article says otherwise. It's bleedingly obvious that it's satire. Do you think people seriously ask for donations to fund a private airplane?

$20K of gear isn't that much if you're an independent developer, and if you're working for others as such in the US, and you're not a financial basket case, it's doable. She even says "It put me in the poor house for a few months" so she made sacrifices to get there. You can too, if you want to. Why the envy?

To the contrary, so you really seem to have a problem with her.

[references https://web.archive.org/web/20260529122658/https://justine.l... ]

aleph_minus_one

6 hours ago

> $20K of gear isn't that much if you're an independent developer

Perhaps in the USA if you have a well-paid job at a MAMAA company ...

kitd

6 hours ago

Go to the end. Those machines were provided by the companies discounted so she could continue her llm research. She's not saying "anyone who's anyone" has a 1024-core workstation these days.

toast0

10 hours ago

If we're being overly generous, they're saying you need at least a raspberry pi? You can see a 3x improvement there, which shows the pattern works, and that's good enough for a dinosaur (this interpretation is easier to justify if you just skim the article... Which I did the first time)

But agreeing with you, I've done big optimization stuff for multicore servers (not as many cores, but same kind of work) and my workstation was something small with not even the same os. I don't need the big machine on my desk to understand the concepts. I just need the big machine to check my work. For me, that's always been a production machine, sometimes a production machine taken out of rotation for pre-validation before running on production load. I guess I should mention, I work on applications specifically, and libraries and kernels as it relates to making whatever my application is work better. I also don't have a problem with pinning threads to cpus... but my applications are usually one big program that fills the system. Someone writing a general purpose library has a harder time.

Of course, if you want to do this kind of work and you don't have your own production load, you're going to have to borrow, rent, or buy a big machine. It doesn't need to be your workstation though. I hate working with cloud nonsense, but if your tests are short, and you do the upfront work to make your images start fast, you can probably save a lot of money by renting spot instances when testing ... I don't know if you can do spot instances of bare metal though, so you're probably stuck with vm overhead.

khuey

9 hours ago

Yeah, you can rent an equivalent workstation from AWS for under $10/hour (and that's the on demand price) so I don't think cost is a huge barrier to doing this sort of work. The language and listing the prices of the workstations down to the penny just strikes me as a rather unprofessional way to communicate.

lpapez

8 hours ago

Except that is not what the article says and you clearly missed the sarcasm.

nutjob2

7 hours ago

She's clearly making a point about taking advantage of the optimization/algorithm she's pitching, and doesn't seem very serious. Alternatively, someone reading that as a serious claim is rather naive.

dan_sbl

8 hours ago

Love how the 512 GB of RAM documented in one of his setups as costing $2,776 now sells for $18,299. https://justine.lol/rseq/threadripper.html

brcmthrowaway

8 hours ago

His?

riffraff

6 hours ago

to be empathetic with GP, I keep reading the name as Justin everytime I see it, even though I know it's Justine.

Veserv

10 hours ago

Restartable windows, or more generically introspection windows, are a really useful technique you can apply in any situation where you understand or control the sources of preemption. The earliest uses of this technique in operating systems that I am aware of are ~25 years old.

The key insight is that the preempter can introspect the program counter of the code being preempted (which is now stable since it was preempted) and act accordingly. The simplest mechanism is to reset their program counter if in a critical section. The more generic mechanism is to jump them to a supplied address. This allows you to do things like hard abort and more.

You can further remove the need for the preempter to understand the preempted code by having the preempted code create a self-introspection code snippet and supplying that with the program counter at preemption. So the preempter just vectors them to their own code which knows how to interpret its own state at any preemption point.

senderista

8 hours ago

There is a paper from Sun that anticipated tcmalloc's development of rseq by over a decade:

https://dl.acm.org/doi/abs/10.1145/512429.512451

Veserv

3 hours ago

Yep, it is a fairly old technique with a lot of of general applicability beyond just allowing mutex elision for usage of per-core data structures amidst potential core migration. But apparently using your own expert knowledge and actually explaining things and describing generalizations is worthy of flagging these days.

senderista

3 hours ago

I have no idea why your comment was flagged :(

yubblegum

6 hours ago

> chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace

Anyone with an informed opinion on this statement? It's seems counter intuitive (npi).

khuey

6 hours ago

The author is referring to false sharing (https://en.wikipedia.org/wiki/False_sharing). CPU caches operate at cache line granularity (typically 64 bytes) so writes to one part of the cache line can require synchronization with writes to non-overlapping parts of the same cache line. This can dramatically reduce performance when there are a large number of cores operating on the same cache line.

If you remove the 64 byte alignment (which forces each counter variable onto a separate cache line) from hitcounter-shard.c you ought to be able to see the performance difference for yourself.

yubblegum

5 hours ago

Thanks for the effort but the q wasn't "what is false sharing",

The Q is: is it true the CPU mutexes are actually slower than those implemented in userspace?

khuey

5 hours ago

The "CPU mutex" is just the cache coherency mechanism. If you shard your data to avoid triggering it as suggested, then yes, it's much faster.

EDIT: or maybe you're asking if introducing an explicit userspace mutex is better than a lockless algorithm with false sharing issues. The answer is that it's workload dependent but it definitely can be.

yubblegum

5 hours ago

Let's try this again:

OP > The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace.

The claim by OP is that "chances are" that userspace mutexes are better than CPU's internal mutexes. So either h/w guys are (for a first) lagging s/w folks and using outdated approaches to creating a mutex in hardware, OR, we somehow must use an inferior approach when implementing a mutex in a CPU, OR, ..

How is it possible that a hardware implementation of an algorithm could be slower than its software variant, and that in "userspace" and not even the kernel.

mota7

3 hours ago

If you have two threads on different cores that write to the same cacheline, the CPU has to enforce write ordering. The way it does this is for one of the cores to acquire a write lock on the cacheline. The actual implementation of this is via some variant of the MESI protocol that results in the cacheline on one core going to an Exclusive state, while copies of the cacheline on every other core becomes Invalid.

MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles).

Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved.

Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc.

So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI.

To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory.

khuey

3 hours ago

> The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI.

Even without rseq(2) you're looking at a "different algorithm". For one, a proper userspace mutex allows the excluded threads to be descheduled and the core can be repurposed to run other tasks, rather than hammering away at the contended cache line.

yubblegum

3 hours ago

Thank you.

Taniwha

3 hours ago

The cache coherency protocols that sit between the CPUs and DRAM always essentially "use a mutex": when a cpu wants to write it broadcasts to all the other CPUs and either gets the latest copy from whoever wrote it last and shoots down any read-only copies in other CPUs or reads it from DRAM (or converts a read only copy to be writeable)

This happens on every memory access, so the thing you want to avoid is ping ponging writeable cache lines between CPUs (especially before you have a chance to actually write it) - LL/SC instructions sit on top of these protocols and allow instructions to tell is a cache line had been "stolen" before you have a chance to write it

za_creature

4 hours ago

I'm not a systems guy, but you're probably right. A mutex requires an atomic variable which triggers the same cache coherency mechanism under contention.

What Justine probably meant was an apples to oranges comparison: in userspace, you can sometimes add constraints that allow for a faster algorithm whereas the cpu generally has to assume the worst.

That being said, if you look through this thread you'll find similar issues with OP's phrasing. A fine engineer / hacker, but too much 4chan troll in my book.

matheusmoreira

2 hours ago

This is amazing. I'll definitely use this in my projects.

squirrellous

2 hours ago

IIUC rseq is similar to thread-local data with the additional benefit that it scales with number of CPU cores, not threads. However if you are an application developer and is able to control all the threads in an application, then rseq isn’t that superior.

I fully agree that rseq should be more easily available to Linux developers, though.

keyle

5 hours ago

   ... bidirectional communication with the kernel happens via shared memory.

What could possibly go wrong?

loeg

3 hours ago

Elaborate? Kernel shared memory interfaces are often reasonable (vdso, io_uring).