dragontamer
a month ago
Intel needs to see what has happened to their AVX instructions and why NVidia has taken over.
If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.
GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.
Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.
-------
Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.
Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
jsheard
a month ago
> Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON
It's pretty funny how NEON ended up in there. A former Intel employee decided to implement it for fun and submitted it as a pull request, which Intel quietly ignored for obvious reasons, but then another former Intel employee who still had commit rights merged the PR, and the optics of publicly reverting it would be even worse than stonewalling so Intel begrudgingly let it stand (but they did revoke that devs commit rights).
user
a month ago
adrian_b
a month ago
While there is some truth in what you say, it makes seem like writing in the CUDA style is something new and revolutionary invented by NVIDIA, which it is not.
The CUDA style of writing parallel programs is nothing else than the use of the so-called "parrallel do" a.k.a. "parrallel for" program structure, which has been already discussed in 1963. Notable later evolutions of this concept have been present in "Communicating Sequential Processes" by C.A.R. Hoare (1978-08: "arrays of processes"), then in the programming language Occam, which was designed based on what Hoare had described, then in the OpenMP extension of Fortran (1997-10), then in the OpenMP extension of C and C++ (1998-10).
Programming in CUDA does not bring anything new, except that in comparison e.g. with OpenMP some keywords are implicit and others are different, so the equivalence is not immediately obvious.
Programming for CPUs in the much older OpenMP is equivalent with programming in CUDA for GPUs.
The real innovation of NVIDIA has been the high quality of the NVIDIA CUDA compiler and CUDA runtime GPU driver, which are able to distribute the work that must be done on the elements of an array over all the available cores, threads and SIMD lanes, in a manner that is transparent for the programmer, so in many cases the programmer is free to ignore which is the actual structure of the GPU that will run the program.
Previous compilers for OpenMP or for other such programming language extensions for parallel programming have been much less capable to produce efficient parallel programs without being tuned by the programmer for each hardware variant.
saagarjha
25 days ago
I'm not sure what you mean. All CUDA code needs to be aware of the programming model the GPU imposes on them, splitting their code manually into threads and warps and blocks and kernels to match. This isn't really transparent at all.
dragontamer
a month ago
Oh all of this was being done in the 1980s by Lisp* programmers.
I'm not calling it new. I'm just saying that the intrinsics style is much much harder than what Lisp*, DirectX HLSL, CUDA, OpenCL (etc. etc) does.
A specialized SIMD language makes writing SIMD easier compared to intrinsic style. Look at any CUDA code today and compare it to the AVX that is in the above article and it becomes readily apparent.
pjmlp
a month ago
It is worse than that, given that AVX is the survivor from Larrabee great plan to kill GPUs.
Larrabee was going to take over it all, as I enjoyed its presentation at GDCE 2009.
Earw0rm
a month ago
And a few years later, Intel said we'd get AVX512 on everything by 2016, and that the instruction encoding supported a future extension to 1024.
And then the Skylake and Cannon Lake debacle..
First they pulled it from the consumer chips a fairly short time before launch. Then the server chips it was present in would downclock aggressively when you did use it, so you could get at best maybe 40% more performance, certainly far from the 2x+ it promised.
Ten years on and the AMD 9950X does a pretty good job with it, however.
Earw0rm
a month ago
Oh, and I neglected to mention the protracted development, and short, miserable life, of Cannon Lake itself.
First announced in 2013, it eventually shipped five years later in only a single, crippled dual-core mobile SKU, which lasted just a year in the market before they killed it off.
"Let's put our only consumer implementation of our highest performing vector architecture on a lame-duck NUC chip.", good move guys.
dragontamer
a month ago
I mean, 288-E Core Xeons are about to ship. Xeon 6900 series, right? (Estimated to ship in Q1 2025)
So Larrabee lives on for... some reason. These E cores are well known to be modified Intel Atom cores and those were modified Xeon Phi cores which were Larrabee based.
Just with.... AVX512 being disabled. (Lost when Xeon Phi turned into Intel Atoms).
Intels technical strategy is completely bonkers. In a bad way. Intel invented all this tech 10 to 20 years ago but fails to have a cohesive strategy to bring it to market. There's clearly smart people there but somehow all the top level decisions are just awful
ashvardanian
a month ago
Yes, a lot of weird decisions were made at Intel.
Ironically, AMD waited so long to implement AVX-512, but now has it on both server and mobile chips (natively and 256 bit emulation, respectively). Intel started the whole thing, has a very fragmented stack and is now preparing those E cores with even more new extensions.
Most importantly for Search and AI, it adds AVX_VNNI, which can be used for faster 8-bit integer dot-products: https://github.com/ashvardanian/SimSIMD/blob/75c426fb190a9d4...
Would be interesting to see how matrix multiplication throughput will differ between AVX-512-capable P cores and a larger quantity of AVX_VNNI-capable E cores!
alfiedotwtf
a month ago
A former Intel CEO even wrote a book where every product was planned 20+ years in advance.
Imagine planning 20 years in advance where Moore’s Law is still going strong. Come to think of it, Moore was also CEO of Intel lol
david-gpu
a month ago
> If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512
NVidia compilers would have compiled your code into something functional, but if you want to approach peak performance you need to at least tweak your kernels, and sometimes rewrite them from scratch. See for example the various MMA instructions that were introduced over time.
Edit: I see somebody made a similar comment and you addressed it. Sorry for the churn.
dist-epoch
a month ago
> If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs
That's not true. For maximum performance you need to tweak the code to a particular GPU model/architecture.
Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations of this (increasing capabilities). Code written 15 years ago would not use modern capabilities, like more flexible memory access, atomics.
dragontamer
a month ago
Maximum performance? Okay, you'll have to upgrade to ballot instructions or whatever and rearchitect your algorithms. (Or other wavefront / voting / etc. etc. new instructions that have been invented. Especially those 4x4 matrix multiplication AI instructions).
But CUDA -> PTX intermediate code has allowed for significantly more flexibility. For crying out loud, the entire machine code (aka SASS) of NVidia GPUs has been cycled out at least 4 times in the past decade (128-bit bundles, changes to instruction formats, acquire/release semantics, etc etc)
It's amazing what backwards compatibility NVidia has achieved in the past 15 years thanks to this architecture. SASS changes so dramatically from generation to generation but the PTX intermediate code has stayed highly competitive.
dist-epoch
a month ago
Intel code from 15 years ago also runs today. But it will not use AVX512.
Which is the same with PTX, right? If you didn't use the tensor core instructions or wavefront voting in the CUDA code, the PTX generated from it will not either, and NVIDIA will not magically add those capabilities in when compiling to SASS.
Maybe it remains competitive because the code is inherently parallel anyway, so it will naturally scale to fill the extra execution units of the GPU, which is where most of the improvement is generation to generation.
While AVX code can't automatically scale to use the AVX512 units.
dragontamer
a month ago
It's not the same. AVX2 instructions haven't changed and never will change.
In contrast, NVidia can go from 64-bit instruction bundles to 128-bit machine code (96-bit instruction + 32-bit control information) between Pascal (aka PTX Compute Capacity 5) and Voltage (aka PTX Compute Capacity 7) and all the old PTX code just autocompiles to the new assembly instruction format and takes advantage of all the new memory barriers added in Volta.
Having a PTX translation later is a MAJOR advantage for the NVidia workflow.
ashvardanian
a month ago
There is still a lot of similarity between CPU and GPU programming - between AVX and PTX. Different generations of CPU cores handle the same AVX2 instructions differently. The microcode changes and the schedulers change, but the process is transparent for the user, similar to PTX.
mmoskal
a month ago
I imagine there is and order of magnitude of difference between how much you can translate in software, with large memory and significant time budget to work with, compared to microcode.
dragontamer
a month ago
Most CPU instructions are 1-to-1 with their microcode. I dare say that microcode is nearly irrelevant, any high-performance instruction (ex: multiply, add, XOR, etc. etc.) is but a single instruction anyway.
Load/Store are memory dependent in all architectures. So that's just a different story as CPUs and GPUs have completely different ideas of how caches should work. (CPUs aim for latency, GPUs for bandwidth + incredibly large register spaces with substantial hiding of latency thanks to large occupancies).
-------------
That being said: reorder buffers on CPUs are well over 400-instructions these days, with super-large cores (like Apple's M4) is apparently on the order of 600 to 800 instructions.
Reorder buffers are _NOT_ translation. They're Tomasulo's algorithm (https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm). If you want to know how CPUs do out-of-order, study that.
I'd say CPUs have small register spaces (16 architectural registers, maybe 32), but large register files of maybe 300 or 400+. Tomasulo's algorithm is used to out-of-order access registers.
You should think of instructions like "mov rax, [memory]" as closer to "rax = malloc(register); delayed-load(rax, memory); Out-of-order execute all instructions that don't use RAX ahead of us in instruction stream".
Tomasulo's algorithm means using ~300-register file to _pretend_ to be just 16 architectural registers. The 300 registers keeps the data out-of-order and allows you to execute. Registers in modern CPUs are closer to unique_ptr<int> in C++, assigning them frees (aka: reorder buffer) and also mallocs a new register off the register-file.
janwas
a month ago
I hope people aren't writing directly to AVX2. When using a wrapper such as Highway, you get exactly this kind of update after a recompile, or even just running your code on a CPU that supports newer instructions.
The cost is that the binary carries around both AVX2 and AVX-512 codepaths, but that is not an issue IMO.
jandrewrogers
a month ago
Many use cases for SIMD aren't trivially expressible through wrappers and abstractions. It is sometimes cleaner, easier, and produces more optimized codegen to write the intrinsics directly. It isn't ideal but it often produces the best result for the effort involved.
An issue with the abstractions that does not go away is that the optimal code architecture -- well above the level of the SIMD wrappers -- is dependent on the capabilities of the silicon. The wrappers can't solve for that. And if you optimize the code architecture for the silicon architecture, it quickly approximates writing architecture-specific intrinsics with an additional layer of indirection, which significantly reduces any notional benefit from the abstractions.
The wrappers can't abstract enough, and higher level abstractions (written with architecture aware intrinsics) are often too use case specific to reuse widely.
janwas
25 days ago
Wrappers can be zero-overhead, so any claim of better codegen vs the underlying intrinsics sounds dubious. "best result for the [higher] effort involved" also contradicts my experience, so I ask for evidence.
One counterexample: our portable vqsort [1] outperforms AVX-512-specific intrinsics [2].
I agree that high-level design may differ. You seem aware that Highway, and probably also other wrappers, supports specializing code for some target(s), but possibly misunderstand how, given the "additional layer of indirection" claim. Wrappers give you a portable baseline, and remove some of the potholes and ugly syntax, but boil down to inlined wrapper functions.
If you want to specialize, that is supported. And what is the downside? Even if you say the benefit of a wrapper is reduced vs manually written intrinsics (and reinventing all the workarounds for their missing instructions), do you not agree that the benefit is still nonzero?
[1]: https://github.com/google/highway/tree/master/hwy/contrib/so... [2]: https://github.com/Voultapher/sort-research-rs/blob/38f37eef...
saagarjha
25 days ago
The downside is that you write an implementation in Highway, find that it doesn't perform how you want, and then you have to rewrite it.
janwas
25 days ago
Curious - how is/was performance helped by rewriting? Why not reach out to us, to see if it can be fixed in the library - wouldn't that be cheaper than rewriting?
saagarjha
25 days ago
I’ve moved on to other things so I can’t really give details anymore. I understand this is annoying to hear as someone who works on that library but I also want to say that your comment is also annoying for different reasons, which mostly answer your question so I’ll explain anyway.
Highway is (I feel not very controversially) kind of like a compiler but worse at its job. It’s not meant to be as general and it only targets a limited set of code, namely code that is annotated to vectorize well. But looking at it as a compiler is kind of useful: it’s supposed to make writing faster code easier and more automatic. Sometimes compilers are not able to do this, just as Highway can’t either. Maybe its design lacks the expressiveness to represent the algorithm people want. Perhaps it doesn’t quite lower to the optimal code. Maybe it turns out that so little of the operation maps to the constructs that a huge amount needs to go through the escape hatch that you offer, at which point it’s not really worth using the library anyway. In that situation, given an existing and friendly relationship, I would be happy to reach out. But this is a cost to me, because I need to simplify and generalize the thing I want. Then I hand it to you and you decide how you want to tackle it, if at all. All the while I’m waiting and I have code that needs to be written. This is a cost, and something that as an engineer I weigh against just using the intrinsics directly, which I know do exactly what I need but with higher upfront and maintenance costs. When you see someone write their own assembly instead of letting the compiler do it for them, they’re making their version of the same tradeoff.
janwas
24 days ago
Thank you for sharing your thoughts!
> it’s supposed to make writing faster code easier and more automatic Agree with this viewpoint. I suppose that makes it compiler-like in spirit, though much simpler.
I also agree that waiting for input/updates is a cost. What still surprises me, is that you seem to be able to do something differently with intrinsics, while believing this is not possible as a user of Highway. It is indeed possible to call _mm_fixupimm_pd(v1.raw, v2.raw, v3.raw, imm), and the rest of your code can be portable. I would be surprised if heavy usage were made of such escape hatches, but it's certainly interesting to discuss any cases that arise.
I do respect your decision, and that you make clear that raw intrinsics have higher upfront and maintenance costs. I suppose it's a matter of preference and estimating the return on the investment of learning the Highway vocabulary (=searching x86_128-inl.h for the intrinsic you know).
Personally, I find the proliferation of ISAs makes a clear case against hand-written kernels. But perhaps in your use case, only x86 will continue to be the only target of interest. Fair enough.
imtringued
a month ago
Most video encoders and decoders consist of kernels with hand written SIMD instructions/intrinsics.
janwas
a month ago
Agreed. FWIW we demonstrated with JPEG XL (image codec, though also with animation 'video' support) that it is possible to write such kernels using the portable Highway intrinsics.
Remnant44
a month ago
I would wager that most real world SIMD use is with direct intrinsics.
dragontamer
a month ago
> I hope people aren't writing directly to AVX2.
Did you not read the article? It's using AVX intrinsics and NEON intrinsics.
janwas
25 days ago
I did, and I truly do not understand why some people do this. As shown in the reddit comments on this article [1], the initial intrinsics version was quite suboptimal and clearly worse than portable code [2].
When not busy unnecessarily rewriting everything for each ISA, it is easier to see and have time for vital optimizations such as unrolling :)
[1]: https://www.reddit.com/r/cpp/comments/1gzob1g/understanding_... [2]: https://github.com/google/highway/blob/master/hwy/contrib/do...
user
a month ago
saagarjha
25 days ago
This is not really fair or true. Nvidia changes the meaning of PTX when they want to. For example, warp thread divergence is something they implemented in an architecture revision, technically breaking existing code. With SM90 (Hopper) they have even started including unstable features in PTX that they reduce promises for even further. And of course everyone who cares about performance is rewriting their kernels (or using someone else's rewritten kernels) for each new architecture. I honestly do not think it is fair to compare this to the CPU landscape, which has much stronger backwards compatibility guarantees.
variadix
a month ago
How much of this is because CUDA is designed for GPU execution and because the GPU ISA isn’t a stable interface? E.g. new GPU instructions can be utilized by new CUDA compilers for new hardware because the code wasn’t written to a specific ISA? Also, don’t people fine tune GPU kernels per architecture manually (either by hand or via automated optimizers that test combinations in the configuration space)?
dragontamer
a month ago
NVidia PTX is a very stable interface.
And the PTX to SASS compiler DOES a degree of automatic fine tuning between architectures. Nothing amazing or anything, but it's a minor speed boost that has made PTX just a easier 'assembly-like language' to build on top of.
janwas
a month ago
My understanding is that there is a lot of hand-writing (not just fine-tuning) going on. AFAIK CuDNN and TensorRT are written directly as SASS, not CUDA. And the presence of FP8 in H100, but not A100, would likely require a complete rewrite.
dragontamer
a month ago
Cub, thrust and many other libraries that make those kernels possible don't need to be rewritten.
When you write a merge sort in CUDA, you can keep it across all versions. Maybe the new instructions can improve a few corner cases, but it's not like AVX to AVX512 where you need to rewrite everything.
Ex: https://github.com/NVIDIA/cub/blob/main/cub/device/device_me...
janwas
25 days ago
I agree not everything needs to be rewritten. And neither does code using an abstraction such as Highway, so we can stop beating that dead horse.
synack
a month ago
I’ve been playing with a new Lunar Lake laptop and they’ve complicated things even further with the Neural Processing Unit (NPU)
Now if your vectors are INT8/FP8 you’re supposed to shovel them into this accelerator via PCIe, rather than packing into registers for AVX512.
I wish they’d just pick an interface for vector ops and stick with it.
janwas
a month ago
Max performance is a stretch - recompilation would not utilize tensor cores, right?
"too hard for mainstream programmers" seems overly pessimistic. I've run several workshops where devs have written dot-product kernels using Highway after 30 minutes of introduction.
kristianp
a month ago
They said intrinsics. Highway is an abstraction on top of intrinsics.
janwas
25 days ago
OK :) A thin abstraction, though. If comparing with alternative categories such as domain-specific language or autovectorization, I'd still classify Highway as intrinsics, just portable and easier to use.
snihalani
a month ago
> software methodology to interact with SIMD using GPU-like languages should be a priority.
What's your opinion on sycl?
jabl
a month ago
Sometimes I wonder about an alternative history scenario where CPU ISA's would have chosen a SIMT style model instead of SIMD. "Just" have something like fork/join instructions to start/stop vector mode, otherwise use the standard scalar instructions in both scalar and vector mode. Would have avoided a lot of combinatorial explosion in instructions. (of course you'd have to do something for cross-lane operations, and later tensor instructions etc.)
janwas
a month ago
Not sure why SIMT would help, it requires more compiler transforms than if the code is written for packets/vectors or whatever we want to call them. As you note, cross-lane is a key part of a good SIMD abstraction. Vulkan calls it "subgroups", but from where I sit it's still SIMD.
ip26
a month ago
Is CUDA not more analogous to using MKL, rather than AVX?