BeeOnRope
9 months ago
To answer a question implied in the article, per-lookup timing with rdtscp hurts the hash more than the btree for the same reason the hash is hurt by the data-depending chaining: rdtscp is an execution barrier which prevents successive lookups from overlapping. rdtsc (no p) isn't, and would probably produce quite different timings.
That the btree doesn't benefit from overlapping adjacent lookup/inserts is intereting.
I suppose it is because btree access (here) involves data-dependent branches, and so with random access you'll get about L mispredicts per lookup in an L-deep tree, so adjacent lookups are separated by at least one mispredict: so adjacent lookup can overlap execution, but the overlapping is useless since everything beyond the next mispredict is useless as it is on the bad path.
That's probably at least true for the small map regime. For the larger maps, the next iteration is actually very useful, even if on a mispredicted path, because the date accesses are at the right location so it serves to bring in all the nodes for the next iteration. This matters a lot outside of L2. At 5 instructions per comparison and 32-element nodes, however, there are just so many instructions in the window for 1 lookup it's hard to make it to the next iteration.
So b-trees benefit a lot from a tight linear seach (e.g. 2 instructions per check, macro-fused to 1 op), or a branch-free linear search, or far better than those for big nodes, a vectorized branch-free search.
dwattttt
9 months ago
> the overlapping is useless since everything beyond the next mispredict is useless as it is on the bad path
Is this a consequence of Spectre et al mitigations?
BeeOnRope
9 months ago
No, just a consequence of how mispredicts work: all execution after a mispredict is thrown away: though some traces remain in the cache, which can be very important for performance (and also, of course, Spectre).
jnordwick
9 months ago
Can you be more specific about "all execution after a mispredict is thrown away". Are you saying even non-dependant instructions? I thought the CPU just ignored the registers resulting from the wrong instruction path and started calculating the correct path - that the way the register file worked allowed it be to much better than just junking everything.
A little less verbosely: "is non-dependant ILP work also tossed on a mispredict even though that won't change?
BeeOnRope
9 months ago
> Can you be more specific about "all execution after a mispredict is thrown away". Are you saying even non-dependant instructions?
To clarify, the misprediction happens at some point in the linear instruction stream, and all instructions and results before that in the linear stream are kept and everything after is thrown out. Everything after is wrong because the stream of instructions didn't go down the expected path, so it's not even really about instruction dependencies at that point: the wrong stream of instructions executed in the first place.
In some cases the wrong and good paths join up very quickly, e.g.:
cmp rax, rax
jl over
inc rax
over:
...
In this case the jump is over a single instruction so even if mispredicted the right stream will vary only in the `inc` instruction, so in principle it seems possible to save some of the further work which doesn't depend on rax, but this is very difficult in practice and no CPU does it in a general way that I know of. I believe some POWER arches could actually internally covert the pattern a above into the moral equivalent of a predicated operation, removing the branch completely, but that's a different thing entirely with a different set of tradeoffs.jnordwick
9 months ago
Does this mean that a compiler should strive to move everything before the branch that isn't affected by it?
eg, if there was:
over:
inc rbx
it should move that before the cmp? Is this always easy to do and done by the compiler?dwattttt
9 months ago
That's the part I was curious about; whether there would've been a helpful cache impact, if not for modern Spectre prevention.
starspangled
9 months ago
Spectre mitigations don't change that, some of them do require adding speculation barriers or otherwise turn off branch prediction for cases where unprivileged state can be used to direct mis-predicted privileged branches into gadgets which can create a side-band to privileged data with speculative state.
But in general execution (i.e., no privilege domain crossings), this mechanism is not required.
Performance effects of executing mispredicted branches (called something like "wrong-path execution" or "mispredicted path ..." in literature) is interesting and it has been studied. I don't know what the state of the art is, although I've seen results showing both speedups and slowdowns (as you would expect with any cache / BP kind of topic :P).
BeeOnRope
9 months ago
> Spectre mitigations don't change that, ...
Yes, exactly. To the first order I think Spectre didn't really change the performance of existing userspace-only code. What slowed down was system calls, kernel code and some things which were recompiled or otherwise adjusted to mitigate some aspects of Spectre. There might be a rare exception, e.g., IIRC `lfence` slowed down on AMD in order to make it more useful as a speculation barrier on AMD but this is hardly an instruction that saw much use before.
> I don't know what the state of the art is, although I've seen results showing both speedups and slowdowns
Yeah. This seems like a pretty cut and dry case where you'd get a speedup from wrong-path misses, since the independent next search will be correctly predicted from the start and access exactly the right nodes, so it serves as highly accurate prefetching: it only gets thrown out because of a mispredict at the end of the _prior_ search.
Something like the misses within a single binary search are more ambiguous: for random input the accuracy drops off like 0.5^n as you predict n levels deep, but that still adds up to ~double MLP compared to not speculating, so in a microbenchmark it tends to look good. In the real world with 1 lookup mixed in with a lot of other code, the many cache lines brought in on the bad path may be overall worse than inserting a speculation barrier yourself.
That's the cool part: we can choose whether we want speculation or not if we know up front if it's harmful.