stinkbeetle
10 hours ago
Data speculation is a CPU technique too, which Apple CPUs are known to implement. Apparently they can do stride detection when predicting address values.
Someone with a M >= 2 might try the code and find no speedup with the "improved" version, and that it's already iterating faster than L1 load-to-use latency.
adrian_b
4 hours ago
Value prediction and address prediction are very different things.
Address prediction for loads and stores, by detecting various kinds of strides and access patterns, is done by most CPUs designed during the last 25 years and it is used for prefetching the corresponding data. This is as important for loads and stores as branch prediction for branches.
On the other hand, value prediction for loads is done by very few CPUs and for very restricted use cases, because in general it is too costly in comparison with meager benefits. Unlike for branch direction prediction and branch target prediction, where the set from which the predicted value must be chosen is small, the set from which to choose the value that will be returned by a load is huge, except for very specific applications, e.g. which repeatedly load values from a small table.
The application from the parent article is such a very special case, because the value returned by the load can be computed without loading it, except for exceptional cases, which are detected when the loaded value is different from the pre-computed value.
stinkbeetle
2 hours ago
> Value prediction and address prediction are very different things.
Both are classes of data prediction, and Apple CPUs do both.
> Address prediction for loads and stores, by detecting various kinds of strides and access patterns, is done by most CPUs designed during the last 25 years and it is used for prefetching the corresponding data. This is as important for loads and stores as branch prediction for branches.
That is not what is known as load/store address prediction. That is cache prefetching, which of course has to predict addresses in some manner too.
> On the other hand, value prediction for loads is done by very few CPUs and for very restricted use cases, because in general it is too costly in comparison with meager benefits. Unlike for branch direction prediction and branch target prediction, where the set from which the predicted value must be chosen is small, the set from which to choose the value that will be returned by a load is huge, except for very specific applications, e.g. which repeatedly load values from a small table.
I'm talking about load address prediction specifically. Apple has both, but load value prediction would not trigger here because I don't think it does pattern/stride detection like load address, but rather is value based and you'd have to see the same values coming from the load. Their load address predictor does do strides though.
I don't know if it needs cache misses or other long latency sources to kick in and start training or not, so I'm not entirely sure if it would capture this pattern. But it can capture similar for sure. I have an M4 somewhere, I should dig it out and try.
> The application from the parent article is such a very special case, because the value returned by the load can be computed without loading it, except for exceptional cases, which are detected when the loaded value is different from the pre-computed value.
bjornsing
10 hours ago
But that works on a different level, right? At least as I understand it data speculation is about prefetching from memory into cache. This trick is about using the branch predictor as an ultra-fast ”L0” cache you could say. At least that’s how I understand it.
stinkbeetle
7 hours ago
This is doing value speculation in software, using the branch predictor. The hardware of course does do that and instead uses different tables for deriving a predicted value, and misprediction will be detected and flushed in a slightly different way.
But the effect on the main sequence of instructions in the backend will be quite similar. In neither case is it a "prefetch" as such, it is actually executing the load with the predicted value and the result will be consumed by other instructions, decoupling address generation from dependency on previous load result.
bjornsing
4 hours ago
Yeah that’s sort of how I understand the OP too: The CPU will execute speculatively on the assumption that the next element in the linked list is consecutive in memory, so it doesn’t have to wait for L1 cache. It needs to check the real value in L1 of course, but not synchronously.