A day in the life of the fastest supercomputer

79 pointsposted 6 days ago
by nradclif

59 Comments

kkielhofner

3 days ago

I have a project on Frontier - happy to answer any questions!

Funny story about Bronson Messer (quoted in the article):

On my first trip to Oak Ridge we went on a tour of “The Machine”. Afterwards we were hanging out on the observation deck and got introduced to something like 10 people.

Everyone at Oak Ridge is just Tom, Bob, etc. No titles or any of that stuff - I’m not sure I’ve ever heard anyone refer to themselves or anyone else as “Doctor”.

Anyway, the guy to my right asks me a question about ML frameworks or something (don’t even remember it specifically). Then he says “Sorry, I’m sure that seems like a really basic question, I’m still learning this stuff. I’m a nuclear astrophysicist by training”.

Then someone yells out “AND a three-time Jeopardy champion”! Everyone laughs.

You guessed it, guy was Bronson.

Place is wild.

ai_slurp_bot

3 days ago

Hey, my sister Katie is the reason he wasn't a 4 day champ! Beat him by $1. She also lost her next game

kkielhofner

3 days ago

Hah, that’s amazing!

Now I get to tell him this story next time I see him :).

johnklos

3 days ago

> anyone refer to themselves or anyone else as "Doctor".

Reminds me of the t-shirt I had that said, "Ok, Ok, so you've got a PhD. Just don't touch anything."

dmd

3 days ago

I think when I walked back into my defense and they said "congratulations, Doctor Drucker" was the last time anyone ever called me Doctor except for possibly a hotel clerk when I selected 'Dr' as my honorific.

It's just not in the culture, assuming you mostly work among other PhDs.

kkielhofner

3 days ago

Growing up my dad was a very well known PhD in his field (Occupational Therapy).

There were quite a few people who insisted on using Doctor when referring to others, calling themselves Doctor, etc.

He never did but I experienced it quite a bit.

dmd

3 days ago

I'm talking about scientific/research PhDs, not medical. The title is absolutely used in medical.

dgacmu

2 days ago

Do you happen to remember where you got that shirt?

Asking for a friend. The friend is me. I desperately want that shirt (assuming it's well designed).

It will complement my "Rage Against The Machine Learning" shirt that I'm wearing right now.

johnklos

2 days ago

I had it custom made. Sorry that I can't point you to something already done :(

It's much easier and cheaper than in the past to just go to Ali-express and find a shop with good feedback and cotton shirts, upload a graphic (even if it's just text in a specific font and size), and wait a month. I usually pay around $10 each.

kaycebasques

3 days ago

What's the documentation like for supercomputers? I.e. when a researcher gets approved to use a supercomputer, do they get lots of documentation explaining how to set up and run their program? I got the sense from a physicist buddy that a lot of experimental physics stuff is shared informally and never written down. Or maybe each field has a couple popular frameworks for running simulations, and the Frontier people just make sure that Frontier runs each framework well?

physicsguy

2 days ago

Documentation is mixed but it’s usually similar between clusters.

You typically write a bash script with some metadata in rows at the top that say how many nodes, how many cores on those nodes you want, and what if any accelerator hardware you need.

Then typically it’s just setting up the environment to run your software. On most supercomputers you need to use environment modules (´module load gcc@10.4’) to load up compilers, parallelism libraries, and software, etc. You can sometimes set this stuff up on the login node to to try out and make sure things work, but generally you’ll get an angry email if you run processed for more than 10 minutes because login nodes are a shared resource.

There’s a tension because it’s often difficult to get this right, and people often want to do things like ´pip install <package>’ but you can leave a lot of performance on the table because pre-compiled software usually targets lowest common denominator systems rather than high end ones. But cluster admins can’t install every Python package ever and precompile it. Easybuild and Spack aim to be package managers that make this easier.

Source: worked in HPC in physics and then worked at a University cluster supporting users doing exactly this sort of thing.

ok123456

2 days ago

You run things more or less like you do on your Linux workstation. The only difference is you run your top-level script or program through a batch processing system on a headend node.

You typically develop programs with MPI/OpenMP to exploit multiple nodes and CPUs. In Fortran, this entails a few pragmas and compiler flags.

sega_sai

3 days ago

I know that DOE's supercomputer NERSC has a lot of documentation https://docs.nersc.gov/getting-started/ . Plus they also have weekly events where you can ask any questions about how the code/optimisation etc (I have never attended those, but regularly get emails about those)

Enginerrrd

18 hours ago

My understanding is that usually there is a subject matter expert that will help you adapt your code to the specific machine to get optimal performance when it's your turn for compute time.

tryauuum

3 days ago

Google openmpi, mpirun, slurm. It's not complex.

It's like kubernetes but invented long ago before kubernetes

cubefox

3 days ago

> With its nearly 38,000 GPUs, Frontier occupies a unique public-sector role in the field of AI research, which is otherwise dominated by industry.

Is it really realistic to assume that this is the "fastest supercomputer"? What are estimated sizes for supercomputers used by OpenAI, Microsoft, Google etc?

Strangely enough, the Nature piece only mentions possible secret military supercomputers, but not ones used by AI companies.

patagurbon

2 days ago

There's a pretty big difference between the workloads that these supercomputers run, and those running big LLM models (to be clear, hyperscalars also often have "supercomputers" more like the DoE laboratories for rent).

AI models are trained using one of {Data parallelism, tensor parallelism, pipeline parallelism}. These all have fairly regular access patterns, and want bandwidth.

Traditional supercomputer loads {Typically MPI or SHMEM} are often far more variable in access pattern, and synchronization is often incredibly carefully optimized. Bandwidth is still hugely important here, but insane network switches and topologies tend to be the real secret sauce.

More and more these machines are built using commodity hardware (instead of stuff like Knight's Landing from Intel), but the switches and network topology are still often pretty bespoke. This is required for really fine-tuned algorithms like distributed LU factorization, or matrix multiplication algorithms like COSMOS. The hyperscalars often want insane levels of commodity hardware including network switches instead.

The AI supercomputers you're citing are getting a lot closer, but they are definitely more disaggregated than DoE lab machines by nature of the software they run.

mnky9800n

2 days ago

Where can you learn more about supercomputing?

elicksaur

2 days ago

Microsoft has a system at current #3 spot on the Top500 list. It uses 14.4k Nividia H100s and got about 1/2 the flops of Frontier.

It’s the fastest publicly disclosed. As far as private concerns, I feel like a “prove it” approach is valid.

https://www.top500.org/lists/top500/2024/06/

vaidhy

2 days ago

This is interesting for a different reason too.. MS has 1/4 the number of nodes, while claiming 1/2 the performance. If it is were just numbers game, MS supercomputer has a much higher processor to performance ratio.

rcxdude

3 days ago

There is a difference between a supercomputer and just a large cluster of compute nodes: mainly this is in the bandwidth between the nodes. I suspect industry uses a larger number of smaller groups of highly-connected GPUs for AI work.

p1esk

3 days ago

Do you mean this supercomputer has slower internode links? What are its links? For example, xAI just brought up 100k GPU cluster, most likely with 800Gbps internode links, or maybe even double that.

I think the main difference is in the target numerical precision: supercomputers such as this one focus on maximizing FP64 throughput, while GPU clusters used by OpenAI or xAI want to compute in 16 or even 8 bit precision (BF16 or FP8).

jasonwatkinspdx

2 days ago

It's not just about the link speeds, it's about the topologies used.

Google style infrastructure uses aggregation trees. This works well for fan out fan back in communication patterns, but has limited bisection bandwidth at the core/top of the tree. This can be mitigated with clos networks / fat trees, but in practice no one goes for full bisection bandwidth on these systems as the cost and complexity aren't justified.

HPC machines typically use torus topology variants. This allows 2d and 3d grid style computations to be directly mapped onto the system with nearly full bisection bandwidth. Each smallest grid element can communicate directly with its neighbors each iteration, without going over intermediate switches.

Reliability is handled quite a bit different too. Google style infrastructure does this with elaborations of the map reduce style: spot the stranglers or failures, reallocate that work via software. HPC infrastructure puts more emphasis on hardware reliability.

You're right that F32 and F64 performance are more important on HPC, while Google apps are mostly integer only, and ML apps can use lower precision formats like F16.

wickberg

2 days ago

Almost no modern systems are running Torus these days - at least not at the node level. The backbone links are still occasionally designed that way, although Dragonfly+ or similar is much more common and maps better onto modern switch silicon.

You're spot on that the bandwidth available in these machines hugely outstrips that in common cloud cluster rack-scale designs. Although full bisection bandwidth hasn't been a design goal for larger systems for a number of years.

p1esk

2 days ago

LambdaLabs GPU cluster provides internode bandwidth of 3.2Tbps: I personally verified it in a cluster of 64 nodes (8xH100 servers) and they claim it holds for up to 5k GPU cluster. What is the internode bandwidth of Frontier? Someone claimed it's 200Gbps, which, if true, would be a huge bottleneck for some ML models.

wickberg

2 days ago

Frontier is 4x 200Gbps links per node into the interconnect. The interconnect is designed for 540TB/s of bisection bandwidth. <https://icl.utk.edu/files/publications/2022/icl-utk-1570-202...>

Bisection bandwidth is the metric these systems will cite, and impacts how the largest simulations will behave. Inter-node bandwidth isn't a direct comparison, and can be higher at modest node counts as long as you're within a single switch. I haven't seen a network diagram for LambdaLabs, but it looks like they're building off 200Gbps Infiniband once you get outside of NVLink. So they'll have higher bandwidth within each NVLink island, but the performance will drop once you need to cross islands.

p1esk

2 days ago

I thought NVLink is only for communication between GPUs within a single node, no? I don't know what the size of their switches are, but I verified that within a 64 node cluster I got the full advertised 3.2Tbps bandwidth. So that's 4x as fast as 4x200Gbps, but 800Gbps is probably not a bottleneck for any real world workload.

AlotOfReading

2 days ago

It's 200 Gbps per port, per direction. That's the same as the Nvidia interconnect lambdalabs uses.

markstock

3 days ago

Each node has 4 GPUs, and each of those has a dedicated network interface card capable of 200 Gbps each way. Data can move right from one GPU's memory to another. But it's not just bandwidth that allows the machine to run so well, it's a very low-latency network as well. Many science codes require very frequent synchronizations, and low latency permits them to scale out to tens of thousands of endpoints.

p1esk

2 days ago

200 Gbps

Oh wow, that’s pretty bad.

wickberg

2 days ago

That's 200Gbps from that card to any other point in the other 9,408 nodes in the system. Including file storage.

Within the node, bandwidth between the GPUs is considerably higher. There's an architecture diagram at <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html> that helps show the topology.

p1esk

2 days ago

I see, OK, I misinterpreted it as per node bandwidth. Yes, this makes more sense, and is probably fast enough for most workloads.

ungreased0675

6 days ago

I was hoping for a list of projects this system has queued up. It’d be interesting to see where the priorities are for something so powerful.

pelagicAustral

6 days ago

You can infer a little from this [0] article:

ORNL and its partners continue to execute the bring-up of Frontier on schedule. Next steps include continued testing and validation of the system, which remains on track for final acceptance and early science access later in 2022 and open for full science at the beginning of 2023.

UT-Battelle manages ORNL for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science

[0] https://www.ornl.gov/news/frontier-supercomputer-debuts-worl...

user

2 days ago

[deleted]

iJohnDoe

3 days ago

The analogies used in this article were a bit weird.

Two things I’ve always wondered since I’m not an expert.

1. Obviously, applications must be written to run effectively to distribute the load across the supercomputer. I wonder how often this prevents useful things from being considered to run on the supercomputer.

2. It always seems like getting access to run anything on the supercomputer is very competitive or even artificially limited? A shame this isn’t open to more people. That much processing resources seems like it should go much further to be utilized for more things.

msteffen

3 days ago

My former employer (Pachyderm) was acquired by HPE, who built Frontier (and sells supercomputers in general), and I’ve learned a lot about that area since the acquisition.

One of the main differences between supercomputers and eg a datacenter is that in the former case, application authors do not, as a rule, assume hardware or network issues and engineer around them. A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail. This assumption greatly simplifies the work of writing such software, as error handling is typically one of the biggest, if not the biggest, sources of complexity a distributed system. It makes engineering the hardware much harder, of course, but that’s how HPE makes money.

A second difference is that RDMA (Remote Direct Memory Access—the ability for one computer to access another computer’s memory without going through its CPU. The network card can access memory directly) is standard. This removes all the complexity of an RPC framework from supercomputer workloads. Also, the L1 protocol used has orders of magnitude lower latency than Ethernet, such that it’s often faster to read memory on a remote machine than do any kind of local caching.

The result is that the frameworks for writing these workloads let you more or less call an arbitrary function, run it on a neighbor, and collect the result in roughly the same amount of time it would’ve taken to run it locally.

guenthert

2 days ago

> A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail.

HPC applications were driving software checkpointing. If a job runs for days, it's not all that unlikely that one of hundreds of machines fails. Simultaneously, re-running a large job, is fairly costly on such a system.

Now, while that exists, I don't know how typical this is actually used. In my own, very limited, experience, it wasn't and job-failures due to hardware failure were rare. But then, the cluster(s) I tended to were much smaller, up to some 100 nodes each.

nxobject

2 days ago

I wouldn’t be surprised if the nice guarantees given by scientific supercomputers came from the time when mainframes were the only game in town for scientific computing.

tryauuum

3 days ago

I feel like the name "supercomputer" is overhyped. It's just many normal x86 machines running Linux and connected with fast network.

Here in Finland I think you can use LUMI supercomputer for free. With a condition that the results should be publically available

NegativeK

3 days ago

I think you've used the "just" trap to trivialize something.

I'm surprised that Frontier is free with the same conditions; I expected researchers to need grant money or whatever to fund their time. Neat.

lokimedes

3 days ago

In the beginning they were just “Beowulf clusters” compared to “real” supercomputers. Isn’t it always like this, the romantic and exceptional is absorbed by the sheer scale of the practical and common once someone discovers a way to drive the economy at scale? Cars, aircraft, long-distance communications, now perhaps AI? Yet the words may still capture the early romance.

markstock

3 days ago

FYI: LUMI uses a nearly identical architecture as Frontier (AMD CPUs and GPUs), and was also made by HPE.

nxobject

2 days ago

I'm curious – how much do classified projects play into the workload of Frontier?

wickberg

2 days ago

Frontier runs unclassified workloads. Other Department of Energy systems, such as the upcoming "El Capitan" at LLNL (a sibling to Frontier, procured under the same contract) are used for classified work.

7373737373

3 days ago

So what is the actual utilization % of this machine?

nradclif

3 days ago

I don’t know the exact utilization, but most large supercomputers that I’m familiar with have very high utilization, like around 90%. The Slurm/PBS queue times can sometimes be measured in days.

wickberg

2 days ago

On a node-level, usually these are aiming for around 90-95% allocated. Note that, compared to most "cloud" applications, that usually involves a number of tricks at the system scheduling level to achieve.

At some point, in order to concurrently allocate a 1000-node job, all 1000 nodes will need to be briefly unoccupied ahead of that, and that can introduce some unavoidable gaps in system usage. Tuning in the "backfill" scheduling part of the workload manager can help reduce that, and a healthy mix of smaller single-node short-duration work alongside bigger multi-day multi-thousand-node jobs helps keep the machine busy.

dauertewigkeit

3 days ago

Don't the industry labs have bigger machines by now? I lost track.

wickberg

2 days ago

"Aurora" at Argonne National Labs is intended to be a bit bigger, but has suffered through a long series of delays. It's expected to surpass Frontier on the TOP500 list this fall once they some issues resolved. El Capitan at LLNL is also expected to be online soon, although I'm not sure if it'll be on the list this fall or next spring.

As others note, these systems are measured by running a specific benchmark - Linpack - and require the machine to be formally submitted. There are systems in China that are on a similar scale, but, for political reasons, have not formally submitted results. There are also always rumors around the scale of classified systems owned by various countries that are also not publicized.

Alongside that, the hyperscale cloud industry has added some wrinkles to how these are tracked and managed. Microsoft occupies the third position with "Eagle", which I believe is one of their newer datacenter deployments briefly repurposed to run Linpack. And they're rolling similar scale systems out on a frequent basis.

physicsguy

2 days ago

Fastest publicly known supercomputer…

langcss

3 days ago

Or worlds smallest cloud provider?

johnklos

2 days ago

The world's smallest cloud provider could be someone running a single Raspberry Pi Zero.

"Cloud" doesn't mean much more than "computer connected to the Internet".

CaliforniaKarl

3 days ago

That's a bit of an apples-and-oranges comparison. Cloud services normally have different design goals.

HPC workloads are often focused on highly-parallel jobs, with high-speed and (especially) low-latency communications between nodes. Fun fact: In the NVIDIA DGX SuperPOD Reference Architecture, each DGX H100 system (which has eight H100 GPUs per system) has four Infiniband NDR OSFP ports dedicated to GPU traffic. IIRC, each OSFP port operates at 200 Gbps (two lanes of 100 Gbps), allowing each GPU to effectively have its own IB port for GPU-to-GPU traffic.

(NVIDIA's not the only group doing that, BTW: Stanford's Sherlock 4.0 HPC environment[2], in their GPU-heavy servers, also uses multiple NDR ports per system.)

Solutions like that are not something you'll typically find in your typical cloud provider.

Early cloud-based HPC-focused solutions centered on workload locality, not just within a particular zone but with a particular part of a zone, with things like AWS Placement Groups[3]. More-modern Ethernet-based providers will give you guides like [4], telling you how to supplement placement groups with directly-accessible high-bandwidth network adapters, and in particular support for RDMA [4] or RoCE (RDMA over Converged Ethernet), which aims to provide IB-like functionality over Ethernet.

IMO, the closest analog you'll find in the cloud, to environments like Frontier, is going to be IB-based cloud environments from Azure HPC ('general' cloud) [5] and specialty-cloud folks like Lambda Labs [6].

[1]: https://docs.nvidia.com/dgx-superpod/reference-architecture-...

[2]: https://news.sherlock.stanford.edu/publications/sherlock-4-0...

[3]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placemen...

[4]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html

[5]: https://azure.microsoft.com/en-us/solutions/high-performance...

[6]: https://lambdalabs.com/nvidia/dgx-systems