phkahler
3 days ago
OpenMP is one of the easiest ways to make existing code run across CPU cores. In the simplest cases you simply add a single #pragma to C code and it goes N times faster. This is when you're running a function in a loop with no side effects. Some examples I've done:
1) ray tracing. Looping over all the pixels in an image using ray tracing to determine the color of each pixel. The algorithm and data structures are complex but don't change during the rendering. N cores is about N times as fast.
2) in Solvespace we had a small loop which calls a tessellation function on a bunch of NURBS surfaces. The function was appending triangles to a list, so I made a thread-local list for each call and combined them after to avoid writes to shared data structure. Again N times faster with very little effort.
The code is also fine to build single threaded without change if you don't have OpenMP. Your compiler will just ignore the #pragmas.
pixelesque
3 days ago
> OpenMP is one of the easiest ways to make existing code run across CPU cores.
True (or with Intel TBB), however as someone with a lot of experience optimising HPC algorithms for rendering, geometry processing and simulation, there are caveats, and quite often you can get situations where the existing code that is parallelised this way more naively can spend disproportionate amounts of CPU usage on spinlocks in OpenMP or TBB instead of doing useful work. (I've also noticed the same thing happening with Rayon in Rust).
Sometimes I've looked at code other colleagues have "parallelised" this way, and they've said "yes, it's using multiple threads", but when you profile it with perf or vtune, it's clearly not really doing that much *useful* parallel work, and sometimes it's even slower than single-threaded from a wall-clock standpoint, and people just didn't check if it was faster, they just looked at the CPU usage, and didn't notice the spinlocks.
CoastalCoder
3 days ago
Here's some reading that I personally have found helpful for optimizing parallel programs:
The best I've found so far:
https://cdn.kernel.org/pub/linux/kernel/people/paulmck/perfb...
And some other good reading:
https://www.amazon.com/Systems-Performance-Brendan-Gregg/dp/...
https://fgiesen.wordpress.com/2014/08/18/atomics-and-content...
https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
ddavis
3 days ago
OpenMP is great. I’ve done something similar to your second case (thread local objects that are filled in parallel and later combined). In the case of “OpenMP off” (pragmas ignored), is it possible to avoid the overhead of the thread local object essentially getting copied into the final object (since no OpenMP means only a single thread local object)? I avoided this by implementing a separate code path, but I’m just wondering if there are any tricks I missed that would allow still a single code path
Jtsummers
3 days ago
Give one of the threads (thread ID 0, for instance) special privileges. Its list is the one everything else is appended to, then there's only concatenation or copying if you have more than one thread.
Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.
pjmlp
3 days ago
For C, C++ and Fortran users.
So easiest depends on the target audience.