ipsum2
5 days ago
As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)
> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices
By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).
Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.
maratc
5 days ago
> Nvidia B200s ... offer 2-3x the performance of H100s
For ML, not for HPC. ML and HPC are two completely different, only loosely related fields.
ML tasks are doing great with low precision, 16 and 8 bit precision is fine, arguably good results can be achieved even with 4 bit precision [0][1]. That won't do for HPC tasks, like predicting global weather, computational biology, etc. -- one would need 64 to 128 bit precision for that.
Nvidia needs to decide how to divide the billions of transistors on their new silicon. Greatly oversimplifying, they can choose to make one of the following:
* Card A with *n* FP64 cores, or
* Card B with *2n* FP32 cores, or
* Card C with *4n* FP16 cores, or
* Card D with *8n* FP8 cores, or (theoretically)
* Card E with *16n* FP4 cores (not sure if FP4 is a thing).
Card A would give HPC guys n usable cores, and it would give ML guys n usable cores. On the other end, Card E would give ML guys 16n usable cores (and zero usable cores for HPC guys). It's no wonder that HPC crowd wants Nvidia to produce Card A, while ML crowd wants Nvidia to produce Card E. Given that all the hype and the money are currently with the ML guys (and $NVDA reflects that), Nvidia will make a combination of different cores that is much much closer to Card E than it is to Card A.Their new offerings are arguably worse than their older offerings for HPC tasks, and the feeling with the HPC crowd is that "Nvidia and AMD are in the process of abandoning this market".
[0] https://papers.nips.cc/paper/2020/file/13b919438259814cd5be8...
dragontamer
5 days ago
Doesn't multiply area scale at O(n^2 * log(n)) ?? (At least, I'm pretty sure the Wallace Tree Multiplier circuit is somewhere in that order).
So a 64-bit multiplier is something like 32x more area than a 16-bit multiplier.
But what you say is correct for RAM area or the number of bits you need for register space. So taken holistically, it's difficult to say...
Okay, 64-bit FP is only like 53-bits and 16-bit FP is actually like 11 bits. But you know what I mean. I'm still doing quick napkin math here, nothing formal.
-------
We can ignore adders and subtractor circuits because they are so small. Division is often implemented as reciprocal followed by multiplication circuits for floating point (true division is very expensive).
touisteur
5 days ago
With the B100 somehow announced to have lower scalar FP64 throughput than the H100 (did they remove the DP tensor cores ?), one will have to rely on Ozaki schemes (dgemm with int8 tensor cores) and lots of the recent body of work on mixed-precision linear algebra show there's a lot of computing power to be harnessed from Tensor Cores. One of the problems of HPC now is a level of ossification of some codebases (or the lack of availability of porting/coding/optimizing people). You shouldn't have to rewrite everything every 5 years but the hardware constructors go where they go and we still haven't found the right level of abstraction to avoid big porting efforts.
user
4 days ago
layla5alive
5 days ago
You've heard of SIMD - it's possible to do both, in terms of throughput, with instruction/scheduler/port complexity overhead of course.
ipsum2
5 days ago
Yes, that's a great point that I missed. From anecdotal evidence, it seems more people are using supercomputers for ML use cases, that would have been traditionally done by HPC. (eg training models for weather forecasts)
zekrioca
5 days ago
The Top500 list is useful as a public, standardized baseline that is straightforward, with a predicted periodicity for more than 30 years. It is trickier to compare cloud infras due to their heterogeneity, fast pace, and more importantly, due the lack of standardized tests, although the MLCommons [1] have been very keen on helping with that.
makeitdouble
5 days ago
If I understand your comment correctly, we're taking a stable but not that relevant metric, because the real players of the market are too secretive, fast and far ahead to allow for simple comparisons.
From a distance, it kinda sounds like listening to kids brag about their allowance while the adults don't want to talk about their salary, and try to draw wider conclusions from there.
zekrioca
5 days ago
It seems there was a misunderstanding, as I haven't made any value judgment about LINPACK.
Yes, LINPACK is indeed "old" with a heavy focus on compute power. However, its simplicity serves as a reliable baseline for the types of workflows that supercomputers are designed to handle. Also, at their core, most AI workloads perform essentially the same operations as HPC, albeit with less stability—which, I admit, is a feature, but likely the reason AI-focused systems do not prioritize LINPACK as much.
I am simply saying that any useful metric needs to not only be "stable", but also simple to grasp. Take Green500, probably a significant benchmark for understanding how algorithms consume power, but "too complex" to explain: yet, many cloud providers with their AI supercomputers avoid competing against HPC supercomputers in this domain.
This avoidance isn’t necessarily due to secrecy but rather inefficiencies inherent to cloud systems. Consider PUE (Power Usage Effectiveness)—a highly misleading metric that cloud providers frequently tout. PUE can easily be manipulated, especially with the use of liquid cooling, which is why optimizing for it has become a major factor contributing to water disruptions in several large cities worldwide.
wbl
5 days ago
Even the DoE posts top 500 results when they commission a supercomputer.
makeitdouble
5 days ago
DoE has absolutely no incentive (nor need, I'd argue) to compare their supercomputers to commercially owned data center operations though.
Comparing their crazy expensive custom built HPC to massive arrays of customer grade hardware doesn't bring them additional funds, nor help them more PR wise than being the owner of the fastest individual clusters.
Being at the top of some heap is visibly one of their goal:
khm
5 days ago
DOE clusters are also massive arrays of customer grade hardware. Private cloud can only keep up in low precision work, and that is why they're still playing with remote memory access over TCP, because it's good enough for web and ML.
High precision HPC exists in the private cloud, but you only hear "we don't want to embarrass others" excuses because otherwise you would be able to calculate the cost.
On prem HPC is still very, very much cheaper than hiring out.
pclmulqdq
5 days ago
B200s have an incremental increase in FP64 and FP32 performance over H100s. That is the number format that HPC people care about.
The MI300A can get to 150% the FP64 peak performance that B200 devices can get, although AMD GPUs have historically underperformed their spec more than Nvidia GPUs. It's possible that B200 devices are actually behind for HPC.
cayleyh
5 days ago
Top line comparison numbers for reference: https://www.theregister.com/2024/03/18/nvidia_turns_up_the_a...
It does seem like Nvidia is prioritizing int8 / fp8 performance over FP64, which given the current state of the ML marketplace is a great idea.
nextos
5 days ago
MI300 also have decent performance in FP16 (~108 TFLOPS). Not as good as NVIDIA, but it's getting there. Anyone has experience using these on JAX? Support is said to be decent, but no idea if it's good enough for research-oriented tasks, i.e. stable enough for training and inference.
llm_trw
5 days ago
A cluster is not a super computer.
The whole point of a super computer is that it act as much as a single machine as it is possible while a cluster is a soup of nearly independent machines.
kristjansson
5 days ago
> soup of nearly independent machines
that does a serious disservice to hyperscaler clusters.
llm_trw
5 days ago
Sure but it's closer to the truth than saying they have similar or more raw compute than a super computer.
almostgotcaught
5 days ago
i wish people wouldn't make stuff up just to sound cool.
like do you have actual experience with gov/edu HPC? i doubt it because you couldn't be more wrong - lab HPC clusters are just very very poorly (relative to FAANG) strewn together nodes. there is absolutely no sense in which they are "one single machine" (nothing is "abstracted over" except NFS).
what you're saying is trivially false because no one ever requests all the machines at once (except when they're running linpack to produce top500 numbers). the rest of the time the workflow is exactly like in any industrial cluster: request some machines (through slurm), get those machines, run your job (hopefully you distributed the job across the nodes correctly), release those machines. if i still had my account i could tell you literally how many different jobs are running right now on polaris.
bocklund
5 days ago
Actually, LLNL (the site of El Capitan) has a process for requesting Dedicated Application Time (a DAT) where you use up to a whole machine, usually over a weekend. They occur fairly regularly. Mostly it's lots of individual users and jobs, like you said though.
almostgotcaught
5 days ago
> where you use up to a whole machine
i mean rick stevens et al can grab all of polaris too but even so - it's just a bunch of nodes and you're responsible for distributing your work across those nodes efficiently. there's no sense in which it's a "single computer" in any way, shape or form.
llm_trw
5 days ago
The same way that you're responsible for distributing your single threaded code between cores on your desktop.
davrosthedalek
4 days ago
No. Threads run typically in the same address space. HPC processes on different nodes typically do not.
llm_trw
4 days ago
Define address space.
Cache is not shared between cores.
HPCs just have more levels of cache.
Lest you ignore the fact that infiniband is pretty much on par with top of the line ddr speeds for the matching generation.
moralestapia
4 days ago
>Lest you ignore the fact that infiniband is pretty much on par with top of the line ddr speeds for the matching generation.
You can't go faster than the speed of light (yet) and traveling a few micrometers will always be much faster than traversing a room (plus routing and switching).
Many HPC tasks nowadays are memory-bound rather than CPU-bound, memory-latency-and-throughput-bound to be more precise. An actual supercomputer would be something like the Cerebras chip, a lot of the performance increase you get is due to having everything on-chip at a given time.
formerly_proven
4 days ago
There are four sentences in your comment.
None of them logically relate to another.
One is a question.
And the rest are wrong.
davrosthedalek
4 days ago
Really? How about: "This pointer is valid, has the same numeric value (address) and points to the same data in all threads". The point is not the latency nor bandwidth. The point is the programming/memory model. Infiniband maybe makes multiprocessing across nodes as fast as multiprocessing on a single node. But it's not multithreading.
imtringued
4 days ago
>Cache is not shared between cores.
I feel sorry for you if you believe this. It's not true physically nor is it true on the level of the cache coherence protocol nor is it true from the perspective of the operating system.
almostgotcaught
4 days ago
Tell me you've never run a distributed workload without telling me. You realize if what you were saying were true, HPC would be trivial. In fact it takes a whole lot of PhDs to manage the added complexity because it's not just a "single computer".
llm_trw
4 days ago
If you think parallelizing single threaded code is trivial ... well there's nothing else to say really.
almostgotcaught
4 days ago
Is there like a training program available for learning how to be this obstinate? I would love to attend so that I can win fights with my wife.
davrosthedalek
4 days ago
Maybe llm_trw is your wife?
bravetraveler
5 days ago
Put slurm on it, bam. Supercomputer.
danpalmer
5 days ago
Google is running its own TPU hardware for internal workloads. I believe Nvidia is just resold for cloud customers.
ipsum2
5 days ago
Nvidia GPUs are also used for inference on Google products. It just depends on availability.
deeth_starr_v
5 days ago
Not true. Apple trained some models on their TPU
danpalmer
5 days ago
Apologies, to be clear what I meant was that to my knowledge Google doesn't use GPUs for it's own stuff, but does sell both TPUs and GPUs to others on Cloud.
Also, to be clear, I have no internal info about this, I'm going based on external stuff I've seen.
okdood64
5 days ago
zitterbewegung
5 days ago
Generally HPC compute has lower margins similar to consoles. It makes sense that AMD would fight for that contract more than NVIDIA similar to IBM stopped doing this. Its sort of comparing Apples to Raspberry Pis.
geerlingguy
5 days ago
Hey now I compare Apples to Raspberry Pi's regularly :)
formerly_proven
5 days ago
China has been absent from TOP500 for years as well.
lobochrome
5 days ago
B200 is very much not rolling out because NVIDIA, after the respin, doesn't have the thermals under control (yet).
Your other points may be valid.
deeth_starr_v
5 days ago
Source?
lobochrome
4 days ago
Reuters!
_zoltan_
4 days ago
don't spread FUD please.
user
5 days ago
almostgotcaught
5 days ago
Ya exactly - no one cares about top500 outside of academia (literally have never heard it come up at work). So this is like the gold star (participation award) of DCGPU competition.