bob1029
a day ago
I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.
You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.
The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.
bsenftner
11 hours ago
Astute points. I've worked on an extremely performant facial recognition system (tens of millions of face compares per second per core) that lives in L1 and does not use the GPU for the FR inference at all, only for the display of the video and the tracked people within. I rarely even bother telling ML/DL/AI people it does not use the GPU, because I'm just tired of the argument that "we're doing it wrong".
zipy124
6 hours ago
How are you doing tens of millions of faces per second per core, first of all assuming a 5ghz processor, that gives you 500 cycles per image if you do ten million a second, that's not nearly enough to do anything image related. Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??
Keyframe
5 hours ago
You can't look at it like _that_. Biometrics has its own "things". I don't know what OP is actually doing, but it's probably not classical image processing. Most probably facial features are going through some "form of LGBPHS binarized and encoded which is then fed into an adaptive bloom filter based transform"[0].
Paper quotes 76,800 bits per template (less compressed) and with 64-bit words it's what, 1200 64-bit bitwise ops. at 4.5 Ghz it's 4.5b ops per second / 1200 ops per per comparison which is ~3.75 million recognitions per second. Give or take some overhead, it's definitely possible.
[0] https://www.christoph-busch.de/files/Gomez-FaceBloomFilter-I...
Cache locality is a thing. Like in raytracing and the old confucian adage that says "Primary rays cache, secondary trash".
reactordev
3 hours ago
Correct, it’s probably distance of a vector or something like that after the bloom. Take the facial points as a vec<T> as you only have a little over a dozen and it’s going to fit nicely in L1.
ekidd
5 hours ago
Back in the old days of "Eigenfaces", you could project faces into 12- or 13-dimensional space using SVD and do k-nearest-neighbor. This fit into cache even back in the 90s, at least if your faces were pre-cropped to (say) 100x100 pixels.
dudeofea
5 hours ago
I don't know the application, but just guessing that you don't need to compare an entire full-resolution camera image, but perhaps some smaller representation like an embedding space or pieces of the image
rdedev
6 hours ago
Could you tell me a bit about how you were able to ensure the model is close to the cache?
izabera
5 hours ago
the secret is to keep things ˢᵐᵒˡ
kcb
5 hours ago
No shot are you doing tens of millions of anything useful per second per core. That's like beyond HFT numbers.
kiitos
9 hours ago
> I look at cross core communication as a 100x latency penalty
if your workload is majority cpu-bound then this is true, sometimes, and at best
most workloads are io (i.e. syscall) bound, and io/syscall overhead is >> cross-core communication overhead
pron
8 hours ago
That's fine, but a work-stealing scheduler doesn't redistribute work willy-nilly. Locally-submitted tasks are likely to remain local, and are generally stolen when stealing does pay off. If everything is more-or-less evenly distributed, you'll get little or no stealing.
That's not to say it's perfect. The problem is in anticipating how much workload is about to arrive and deciding how many worker threads to spawn. If you overestimate and have too many worker threads running, you will get wasteful stealing; if you're overly conservative and slow to respond to growing workload (to avoid over-stealing), you'll wait for threads to spawn and hurt your latencies just as the workload begins to spike.
vlovich123
7 hours ago
There’s secondary costs though - because you might run on any thread you have to sprinkle atomics and/or mutexes all over the place (in Rust parlance the tasks spawned must be Send) which have all sorts of implicit performance costs that stack up even if you never transfer the task.
In other words, you could probably easily do 10m op/s per core on a thread per core design but struggle to get 1m op/s on a work stealing design. And the work stealing will be total throughput for the machine whereas the 10m op/s design will generally continue scaling with the number of CPUs.
pron
4 hours ago
An occasional successful CAS (on an owned cache line) has very little cost, but if you have to sprinkle atomics/mutexes all over the place, then there's something that's clearly not scalable in your design regardless of the concurrency implementation (you're expecting contention in a lot of places).
vlovich123
3 hours ago
An atomic add on a 6ghz high end desktop CPU (13900) is I believe on the order of 4-10ns. If it’s in your hot path your hot path can’t go faster than 50-100 million operations/s - that’s the cost of 1 such instruction in your hotpath (down from the 24 billion non-atomic additions your 6ghz could do otherwise). A CAS brings this down to ~20-50 Mops/s. So it’s quite a meaningful slowdown if you actually want to use the full throughput of your CPU. And if that cache line is cached on another CPU you pay an additional hidden latency that could be anywhere from 40-200ns further reducing your hotpath to a maximum of 5-25MHz (and ignoring secondary effects of slowing down those cores without them even doing anything). God forbid there’s any contention - you’re looking at a variance of 20x between the optimal and worst case of how much of a throughput reduction you see by having a single CAS in your hot loop. And this is just talking about the task scheduler - at least in Rust you’ll need to have thread-safe data structures being accessed within the task itself - that’s what I was referring to as “sprinkled”. If you really want to target something running at 10Mops/s on a single core, I don’t think you can possibly get there with a task stealing approach.
zipy124
6 hours ago
What about using something like https://github.com/judofyr/spice?
vlovich123
5 hours ago
These aren’t task queues as are being discussed here. It’s more like rayon - I have a par_iter and I want that to go as fast as possible on a large number of elements. Slightly different use case than thread per core vs work stealing runtime.
dist-epoch
9 hours ago
The thing with GPUs is that for many problems really dumb and simple algorithms (think bubble sort equivalent) are many times faster than very fancy CPU algorithms (think quick sort equivalent). Your typical non-neural-network GPU algorithm is rarely using more than 50% of it's power, yet still outperforms carefully written CPU algorithms.
anyfoo
8 hours ago
> If everything constantly depends on the prior action you will never get any uplift.
I mean... that's kind of a pathological case, no?
Archit3ch
7 hours ago
> If everything constantly depends on the prior action you will never get any uplift.
Not always. For differential equations with large enough matrices, the independent work each core can do outperforms the communication overhead of core-to-core latency.
jayd16
6 hours ago
If its independent work then it's work that doesn't rely on the prior action... At least not in the way the parent means.
Archit3ch
5 hours ago
State can depend on the previous time point, or even the same time point. I see this misconception often in audio programming "you cannot parallelise work because it depends on the previous sample". As long as you can find parallelism somewhere and it's less than the overhead, you can benefit. Obviously if there's zero parallelism in the problem, no amount of cores will help.