jasonthorsness
4 days ago
"We achieve 19.8 GB/s prefix sum throughput—1.8x faster than a naive implementation and 2.6x faster than FastPFoR"
"FastPFoR is well-established in both industry and academia. However, on our target platform (Graviton4, SIMDe-compiled) it benchmarks at only ~7.7 GB/s, beneath a naive scalar loop at ~10.8 GB/s."
I thought the first bit was a typo but it was correct; the naive approach was faster than a "better" method. Another demonstration of how actually benchmarking on the target platform is important!