Building a compile-time SIMD optimized smoothing filter

71 pointsposted 11 hours ago
by PaulHoule

17 Comments

keldaris

10 hours ago

This looks like a nice case study for when you're already using Rust for other reasons and just want to make a bit of numerical code go fast. However, as someone mostly writing C++ and Julia, this does not look promising at all - it's clear that the Julia implementation is both more elegant and faster, and it seems much easier to reproduce that result in C++ (which has no issues with compile time float constants, SIMD, GPU support, etc.) than Rust.

I've written very little Rust myself, but when I've tried, I've always come away with a similar impression that it's just not a good fit for performant numerical computing, with seemingly basic things (like proper SIMD support, const generics without weird restrictions, etc.) considered afterthoughts. For those more up to speed on Rust development, is this impression accurate, or have I missed something and should reconsider my view?

gauge_field

9 hours ago

In terms of speed, Rust is up there with C/C++. See e.g. https://benchmarksgame-team.pages.debian.net/benchmarksgame/... I also ported several algo from C and it was matching the performance.

Regarding SIMD support, the only thing that is missing, is stable support for avx512, and some more exotic feature extensions for deep learning e.g. avx_vnni. Those are implemented and waiting to be included in the next stable versions.

Gpu support: this is still an issue b/c of not enough people working on it, but there projects trying to improve this: see https://github.com/tracel-ai/cubecl .

Const generics: Yeah, there are a few annoying issues: it is limited to small set of types. For instance, you cant use const enum as a generic. Also, you cant use generic parameters in const operations on stable rust: see unstable feature generic_const_exprs.

My main reason for using rust in numerical computing:

- type system. Some find it weird. I find it explicit and easier to understand.

- cargo (nicer cross platform defaults, since I tend to develop both from windows and linux)

- unconditional code generation, with [target_feature(enable = "feature_list")]. This makes it so that I dont have to set different set of flags for each compilation unit when building. It is enough to put that on top of function making use of SIMD.

I agree that if you want to be fast/exploratory in developing algo and you can sacrifice a little bit of performance, Julia is a better choice.

bobmcnamara

7 hours ago

TBH, Intel ISAs have never been very stable, the mixed AVX flavors are just the latest examples.

jonathrg

9 hours ago

Tight loops of SIMD operations seems like something that might be more convenient to just implement directly in assembly? So you don't need to babysit the compiler like this.

anonymoushn

7 hours ago

I'm happy to use compilers for register allocation, even though they aren't particularly good at it

adgjlsfhk1

7 hours ago

Counterpoint: that makes you rewrite it for every architecture and every datatype. A high level language makes it a lot easier to get something more readable, runable everywhere, and datatype generic.

brrrrrm

7 hours ago

You almost always have to for good perf on non-trivial operations

jonathrg

7 hours ago

For libraries, yes that is a concern. In practice you're often dealing with one or a handful of instances of the problem at hand.

menaerus

7 hours ago

If you're ready to settle on much lower IPC then yes.

adgjlsfhk1

6 hours ago

The Julia version of this (https://github.com/miguelraz/StagedFilters.jl) is very clearly generating pretty great code. For `smooth!(SavitzkyGolayFilter{4,3}, rand(Float64, 1000), rand(Float64, 1000))`, 36 out of 74 of the inner loop instructions are vectorized fmadds (of one kind or another). There are a few register spills that seem plausibly unnecessary, and some dependency chains that I think are an inherent part of the algorithm, but I'm pretty that there isn't an additional 2x speed to get here.

menaerus

6 hours ago

I can write SIMD code that's 2x the speed of the non-vectorized code but I can also rewrite that same SIMD code so that 2x becomes 6x.

Point being, not only that you can get 2x on top of initial 2x SIMD implementation but usually much more.

Whether or not you see SIMD in the codegen is not a testament of how good the implementation really is.

IPC is the relevant measure here.

adgjlsfhk1

5 hours ago

IPC looks like 3.5 instructions per clock. (and the speed for bigger inputs will be memory bound rather than compute bound).

menaerus

7 hours ago

Definitely. And I also found the title quite misleading. It's the auto-vectorization that this presentation is trying to cover. Compile-time SIMD OTOH would mean something totally different, e.g. computation during the compile-time using the constexpr SIMD intrinsics.

I'd also add that it's not only about babysitting the compiler but you're also leaving a lot of performance off the table. Auto-vectorized code, generally speaking, unfortunately cannot beat the manually vectorized code (either through intrinsics or asm).

feverzsj

7 hours ago

I think most rust projects still depend on clang's vector extension when SIMD is required.

TinkersW

3 hours ago

I only clicked through the slides and didn't watch the video..but ugh all I see is scalar SIMD in the assembly output(the ss ending means scalar, it would be ps if it was vector)

And they are apparently relying on the compiler to generate it...just no.

Use intrinsics, it ant that hard.

jtrueb

6 hours ago

I think scientific computing in Rust is getting too much attention without contribution lately. We don't have many essential language features stabilized. SIMD, generic const exprs, intrinsics, function optimization overrides, and reasonable floating point overrides related to fast math are a long way from being stabilized. In order to get better perf, the code is full of these informal compiler hints to guide it towards an optimization like autovectorization or branch elision. The semantics around strict floating point standards are stifling and intrinsics have become less accessible than they used to be.

Separately, is Julia hitting a different LA backend? Rust's ndarray with a blas-src on Accelerate is pretty fast, but the Rust implementation is little slower on my macbook. This is a benchmark of a dot product.

```

    const M10: usize = 10_000_000;
    #[divan::bench]
    fn ndarray_dot32(b: Bencher) {
        b.with_inputs(|| (Array::from_vec(vec![0f32; M10]), Array::from_vec(vec![0f32; M10])))
            .bench_values(|(a, b)| {
                a.dot(&b)
            });
    }

    #[divan::bench]
    fn chunks_dot32(b: Bencher) {
        b.with_inputs(|| (vec![0f32; M10], vec![0f32; M10]))
            .bench_values(|(a, b)| {
                a.chunks_exact(32)
                    .zip(b.chunks_exact(32))
                    .map(|(a, b)| a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>())
                    .sum::<f32>()
            });
    }

    #[divan::bench]
    fn iter_dot32(b: Bencher) {
        b.with_inputs(|| (vec![0f32; M100], vec![0f32; M100]))
            .bench_values(|(a, b)| {
                a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>()
            });
    }
    
    ---- Rust ----
    Timer precision: 41 ns (100 samples)
    flops             fast    │ slow    │ median  │ mean
    ├─ chunks_dot32   3.903 ms│ 9.96 ms │ 4.366 ms│ 4.411 ms
    ├─ chunks_dot64   4.697 ms│ 16.29 ms│ 5.472 ms│ 5.516 ms
    ├─ iter_dot32     10.37 ms│ 11.36 ms│ 10.93 ms│ 10.86 ms
    ├─ iter_dot64     11.68 ms│ 13.07 ms│ 12.43 ms│ 12.4 ms
    ├─ ndarray_dot32  1.984 ms│ 2.91 ms │ 2.44 ms │ 2.381 ms
    ╰─ ndarray_dot64  4.021 ms│ 5.718 ms│ 5.141 ms│ 4.965 ms

    ---- Julia ----
    native_dot32:
    Median: 1.623 ms, Mean: 1.633 ms ± 341.705 μs
    Range: 1.275 ms - 12.242 ms

    native_dot64:
    Median: 5.286 ms, Mean: 5.179 ms ± 230.997 μs
    Range: 4.736 ms - 5.617 ms

    simd_dot32:
    Median: 1.818 ms, Mean: 1.830 ms ± 142.826 μs
    Range: 1.558 ms - 2.169 ms

    simd_dot64:
    Median: 3.564 ms, Mean: 3.567 ms ± 586.002 μs
    Range: 3.123 ms - 22.887 ms

    iter_dot32:
    Median: 9.566 ms, Mean: 9.549 ms ± 144.503 μs
    Range: 9.302 ms - 10.941 ms

    iter_dot64:
    Median: 9.666 ms, Mean: 9.640 ms ± 84.481 μs
    Range: 9.310 ms - 9.867 ms

    All: 0 bytes, 0 allocs

```

https://github.com/trueb2/flops-bench