ot
25 days ago
You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
jerrinot
25 days ago
That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!
ot
25 days ago
Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)
catlifeonmars
25 days ago
I guess if you need the concurrency/throughput you should use a userspace green thread implementation. I’m guessing most implementations of green threads multiplex onto long running os threads anyway
jerrinot
25 days ago
In a system with green threads, you typically want the CPU time of the fiber or tasklet rather than the carrier thread. In that case, you have to ask the scheduler, not the kernel.
nly
25 days ago
Why do you need a seqlock? To make sure you're not context switched out between the read of the page value and the rdtsc?
Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
Tbh I thought clock_gettime was a vdso based virtual syscall anyway
ot
23 days ago
> Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
Yes, that's exactly what a seqlock (reader) is.
user
25 days ago
mgaunard
24 days ago
clock_gettime is not doing a syscall, it's using vdso.
jerrinot
24 days ago
clock_gettime() goes through the vDSO shim, but whether it avoids a syscall depends on the clock ID and (in some cases) the clock source. For thread-specific CPU user time, the vDSO shim cannot resolve the request in user space and must transit into the kernel. In this specific case, there is absolutely a syscall.