Keeping 20k GPUs healthy

130 pointsposted 8 days ago
by jxmorris12

62 Comments

bluedino

3 days ago

I help run a fleet of GPU servers, and I might see 1 DIMM or SSD failure for every 50-100 GPU failures.

I realize NVIDIA is just cranking them out as fast as they can, but the quality on them is terrible. They overheat, disappear after you reboot, they fall off the bus, memory failures, and then mix in all the software crashes your users generate...

Our current server vendor is actually good at replacing them, unlike our previous vendor, but the failure rates are just insane. If any other component failed this much we'd have the vendor buy the servers back.

thundergolfer

3 days ago

Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.

  Component                      Type       MTBF (yrs)  AFR
  ─────────────────────────────────────────────────────────

  SSD                            Hardware   ~100        ~1%
  RAM uncorrectable error        Hardware   ~75         ~1-4%
  NVIDIA A100 critical error†    Hardware   0.18 (65d)  -
  NVIDIA H100 critical error†    Hardware   0.15 (50d)  -
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.

Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.

YetAnotherNick

3 days ago

So I ran 16x A100 in GCP for training workloads. And it was hard to keep it running for more than a few days so that matches my number.

However I think a lot of it is driver or some software issue. I remember switching from pytorch docker image to Nvidia's NGC images and the reliability increased very noticeably. Do you have the data for popular docker images?

salynchnew

3 days ago

> operating too close to the operational limit, tipping over it, and then requiring a power cycle.

GPUs--they're just like us!

layoric

3 days ago

I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower.

Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.

formerly_proven

3 days ago

GPU servers always have had crap reliability compared to a normal server (but sticking eight GPUs on a baseboard complicates things). As I understand it (not my domain), this (being a lack of widespread checkpointing and mpift support) is one of the motivating factors for why ML toolkits eschew MPI (besides accelerator-accelerator being an afterthought).

shrubble

3 days ago

If you rebooted every server after 35 days, would that get rid of many of the problems?

direwolf20

3 days ago

It's an average time to failure, not a guarantee. Failures occur randomly.

jvalencia

3 days ago

I'm curious if running them at slightly lower voltage would fix it or if it's a software thing.

nickysielicki

3 days ago

Totally matches my experience, and it feels bizarre inside-looking-out that nobody else talks about it. Hardware from 2010-2020 was remarkably stable, and CPUs are still as stable as they were, but we've had this large influx of money spent on these chips that fall over if you look at them funny. I think it leads to a lot of people thinking, "we must be doing something wrong", because it's just outside of their mental model that hardware failures can occur at this rate. But that's just the world we live in.

It's a perfect storm: a lot of companies are doing HPC-style distributed computing for the first time, and lack experience in debugging issues that are unique to it. On top of that, the hardware is moving very fast and they're ill equipped to update their software and drivers at the rate required to have a good experience. On top of that, the stakes are higher because your cluster is only as strong as its weakest node, which means a single hardware failure can turn the entire multi-million dollar cluster into a paperweight, which adds more pressure and stress to get it all fixed. Updating your software means taking that same multi-million dollar cluster offline for several hours, which is seen as a cost rather than a good investment of time. And a lot of the experts in HPC-style distributed computing will sell you "supported" software, which is basically just paying for the privilege of using outdated software that lacks the bug fixes that your cards might desperately need. That model made sense in the 2010s, when linux (kernel and userspace) was less stable and you genuinely needed to lock your dependencies and let the bugs work themselves out. But that's the exact opposite of what you want to be doing in 2026.

You put all of this together, and it's difficult to be confident whether the hardware is bad, or going bad, or whether it's only manifesting because they're exposed to bugs, or maybe both. Yikes, it's no fun.

dlcarrier

3 days ago

They're also run far closer to the edge of their operational limits than CPUs, so you're far more likely to get one that barely passes manufacturing tests, then degrades just a little tiny bit and stops working.

bigwheels

3 days ago

FWIW, NVIDIA enterprise hardware does come with good warranty and prompt RMA service.

A deep dive on why these beastly cards fail so frequently compared to all other common current day hardware would be fascinating!

indoordin0saur

3 days ago

I don't know much about the subject but GPUs were originally meant for gaming and would run for a few hours to several hours a day and then would get rest periods. The amount of power draw on them would also vary throughout the time they were being actively used. With constant 24/7 usage at max capacity is it just possible that they are being pushed beyond what they were originally engineered for?

ls65536

3 days ago

My intuition would be that constant usage (not exceeding maximum rated capacity/thermals/etc.) should generally result in less wear compared to the more frequent thermal cycling that you might expect from intermittent use, but maybe there's something else going on here too. I suppose this would depend on what exactly the cause of the failure is.

Either way, these are obviously being intentionally sold to be used for non-gaming-type workloads, so it wouldn't be a good argument to state that they're just being (ab)used beyond what they were inteded for...unless somehow they really are being pushed beyond design limits, but given the cost of these things I can't imagine anyone doing this willingly with a whole fleet of them.

greenavocado

3 days ago

Electromigration may be a factor

zozbot234

3 days ago

Electromigration decays exponentially with inverse temperature. If it's genuinely a factor, you're running that GPU way too hot.

userbinator

3 days ago

That changed before, once cryptocurrencies started getting popular.

nickysielicki

3 days ago

> A deep dive on why these beastly cards fail so frequently compared to all other common current day hardware would be fascinating!

P=CV²f

jldugger

3 days ago

It's funny, I've been watching all the nvidia GTC keynotes from 2012-now to better understand the ecosystem and Jensen pretty clearly states a few times "its a miracle it works at all". Clearly he's intending to brag about defect rate on a 50 billion transistor chip but maybe he's more right than he realizes.

salynchnew

3 days ago

It's wild that these are the failure rates for datacenter-grade products. If you were pushing consumer GPU servers all-out, I would expect this kind of variation.

I expect it's not just a problem with Nvidia, though.

jayd16

3 days ago

For comparison they have way more memory than 1 DIMM alone, and plenty of other things going on.

ecesena

3 days ago

Has anyone tried to "turn off some cores" (eg using multi-instance gpu feature) and see if/how that increases reliability?

userbinator

3 days ago

I wonder if GPUs are so dense that SEUs are even more common than in CPUs or RAM.

stingrae

3 days ago

seems like it would be an issue for building datacenters in space/orbit

selkin

3 days ago

One of the issues. The idea that space is the place (for data centers) is more of a desperate marketing than serious engineering.

bflesch

3 days ago

In his newsletter Ed Zitron hammered down the point that GPUs depreciate quickly, but these kind of reliability issues are shocking to read. The GPUs are so common to fail that they hang out in a 24/7 slack channel with customers like Meta (who apparently can't set up a cluster themselves..).

Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds. Assuming they are VC funded the VCs need returns for their funds.

Unlike fiber cable during the dot com boom the currently used GPUs eventually end up in the trash bin. These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

Who will be the one who marks down these "assets"? Who is providing money to buy the next batch of GPUs, now that billions are already spent?

Maybe we'll see a wave of retirements soon.

> It’s underappreciated how unreliable GPUs are. NVIDIA’s hardware is a marvel, the FLOPs are absurd. But the reliability is a drag. A memorable illustration of how AI/ML development is hampered by reliability comes from Meta’s paper detailing the training process for the LLaMA 3 models: “GPU issues are the largest category, accounting for 58.7% of all unexpected issues.” > Imagine the future we’ll enjoy when GPUs are as reliable as CPUs. The Llama3 team’s CPUs were the problem only 0.5% of the time. In my time at Modal we can’t remember finding a single degraded CPU core. > For our Enterprise customers we use a shared private Slack channel with tight SLAs. Slack is connected to Pylon, tracking issues from creation to resolution. Because Modal is built on top of the cloud giants and designed for dynamic compute autoscaling, we can replace bad GPUs pretty fast!

pixl97

3 days ago

>These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

I'm guessing this may be highly dependant on what the bathtub curve looks like, and how much the provider wants to spend on cooling.

Of course with Nvidia being a near monopoly here, they might just not give a fuck and will pump out cards/servers with shitty reliability rates simply because people keep buying them and they don't suffer any economic loss or have to sit in front of a judge.

Be interesting to see what the error rate per TFLOP (no /s, we're looking at operations not time) is compared to older generation cards.

topaz0

3 days ago

> Of course with Nvidia being a near monopoly here, they [...] will pump out cards/servers with shitty reliability rates simply because people keep buying them and they don't suffer any economic loss or have to sit in front of a judge.

Presumably this can't last that much longer, because the people that are buying/running these are already taking on loads of debt/venture capital to buy the past/current round of hardware without seeing much revenue from it. It's much harder to ask investors for multiples of your annual revenue just to maintain your current capabilities than it was a couple years ago to ask for many multiples of your revenue to expand your capabilities dramatically.

charles_irl

3 days ago

> Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds.

You got a link for that? I work on Modal and would be interested in seeing the argument!

We think building a proper software layer for multitenant demand aggregation on top of the public clouds is sufficient value-add to be a sustainable business (cf DBRX and Snowflake).

pphysch

3 days ago

Snowflake and Databricks provide data storage and pipeline features and therefore have extraordinary lock-in potential, which allows them to have sustainable business models.

GPU compute is essentially fungible. That's quite a stretch to compare those business models. Snowflake and Databricks don't necessarily have the best "value-add" and they don't need to.

bflesch

3 days ago

It was on his last newsletter, but I can't link it right now.

ares623

3 days ago

I suppose NVidia could invest in making their GPUs more reliable? But then that'll make everything else even more expensive lol. If only one of the companies on the chain can take one for the team.

touisteur

3 days ago

And NVIDIA supposedly has the exact knowhow for reliablity, as their Jetson 'industrial' parts are qualified for 10-15 years at maximal temp. Of course Jetson is on another point of the flops and watts curve.

Just wondering, if reliability increases if you slow down your use of GPUs a bit. Like pausing more often and stopping chasing every bubble and nvlink-all-reduce optimization.

dsrtslnd23

3 days ago

Jetson uses LPDDR though. H100 failures seem driven by HBM heat sensitivity and the 700W+ envelope. That is a completely different thermal density I guess.

zozbot234

3 days ago

Reliability also depends strongly on current density and applied voltage, even more perhaps than on thermal density itself. So "slowing down" your average GPU use in a long-term sustainable way ought to improve those reliability figures via multiple mechanisms. Jetsons are great for very small-scale self-contained tasks (including on a performance-per-watt basis) but their limits are just as obvious, especially with the recently announced advances wrt. clustering the big server GPUs on a rack- and perhaps multi-rack level.

touisteur

3 days ago

I don't have first-hand knowledge on HBM GPUs but on the RTX Blackwell 6000 Pro Server, the perf difference between the free up-to-600W and the same GPU capped at 300W is less than 10% on any workload I could (including Tensor Core-heavy ones) throw at it.

That's a very expensive 300W and I wonder what tradeoff made them go for this, and whether capping is here a way to increase reliability. ...

Wonder whether there's any writeup on those additional 300 Watts...

zozbot234

3 days ago

> whether capping is here a way to increase reliability

Almost certainly so, and you wouldn't even need to halve the wattage; even a smaller drop ought to bring a very clear improvement. The performance profile you mention is something you see all the time on CPUs when pushed to their extremes; it's crazy to see that pro-level GPUs are seemingly being tuned the same way out of the box.

storystarling

3 days ago

It sounds like those workloads are memory bandwidth bound. In my experience with generative models, the compute units end up waiting on VRAM throughput, so throwing more wattage at the cores hits diminishing returns very quickly.

zozbot234

3 days ago

If they were memory bandwidth bound wouldn't that in itself push the wattage and thermals down comparatively, even on a "pegged to 100%" workload? That's the very clear pattern on CPU at least.

touisteur

3 days ago

That's my experience as well, after monitoring frequency and temp on lots of kernel on all the spectrum from memory-bound, to L2-bound to compute-bound. Hard to reach the 600W with memory-bound kernel. TensorRT manages it somehow with some small to mid networks but perf increase seems capped around 10% too even with all the magic inside.

touisteur

3 days ago

I thought so but no, iterative small matrix multiplication kernel in tensor cores, or pure (generative) compute with ultra-late reduction and ultra-small working memory. nsight-compute says everything is in L1 or small register file, no spilling, and that I am compute bound, good ILP. Can't find a way to get more than 10% for the 300W difference. Thus asking if anyone did better and how and how reliable the HW stays.

pqtyw

3 days ago

Why? Nvidia is already charging as much as they possibly can. Unlike most other components its almost unrelated to manufacturing costs

nradov

3 days ago

Nope. Nvidia has often sold products at below the market price. This has created shortages where scalpers who are able to get some supply immediately resell above list price. It might seem stupid for Nvidia to leave money on the table that way but they don't want to burn relationships with customers by raising list prices (much).

pqtyw

21 hours ago

> below the market price

Adjusting prices daily or weekly (in both directions) wouldn't have necessarily been that helpful for maximizing their long-term profits, though (as you said).

gessha

3 days ago

Why is the GB10 so expensive then ;(

ares623

3 days ago

Why make same money when more money possible?

zkvx7a

3 days ago

A taxonomy and statistics of GPU failures are described in this paper

Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs

https://dl.acm.org/doi/10.1145/3712285.3759821

magicalhippo

3 days ago

Very interesting. From the paper:

H100 shows 3.2 × lower per-GPU mean time between errors (MTBE) compared to A100 for uncorrectable ECC memory errors. The per-GB MTBE of the H100’s HBM3 memory is 24% lower (∼ 8.5M hours) than the A100’s HBM2e memory (∼ 11.3M hours). We conjecture that the reduction in memory resilience stems from H100’s higher memory capacity.

We attribute the decrease in resilience is primarily due to the higher memory capacity (96 GB vs. 40 GB, a 2.4 × increase), which increases the chances of bit flips.

We additionally hypothesize that H100 memory resilience is worse due to (a) a lower signaling voltage that increases susceptibility to bit flips and (b) an increased number of stacks that make heat dissipation challenging and degrade the resilience of memory modules, of the HBM3 memory.

Increasing voltage just makes the heat dissipation problem worse, so probably can't just crank that up.

From what I can gather, a typical A100 or H100 is air cooled. Sounds like liquid cooling them might help, or at least allow you to bump up those voltages without thermal issues.

smsx

3 days ago

Are the numbers in the H100 PCIE vs SXM table swapped for rows 3 onwards? It looks to me like the PCI is showing higher GiB/s numbers, which is counter to expectations. Or am I misunderstanding those benchmarks?

thundergolfer

3 days ago

You're not misunderstanding, the PCIe does indeed outperform on the memory bandwidth tests. But it gets dominated on FLOP/s and real-world application bencharks.

gregjm

3 days ago

I wonder why H100 H2D and D2H unpinned memcpy bandwidth is *faster* on PCIe with vendor B than on SXM with vendor D. Is resizable BAR available on PCIe but not SXM?

Or, could it be a software configuration difference? The driver API flag CU_MEMHOSTREGISTER_IOMEMORY states that host memory being physically contiguous may matter to the driver, in this context for memory-mapped memory. If vendor B has THP enabled or configured differently than vendor D, small allocations up to 2 MiB could be physically contiguous which may result in higher efficiency/more bytes transferred per request.

At a higher level: unpinned memcpy is a performance antipattern. Perhaps vendor D has fewer clients using unpinned memcpy in their workloads than vendor B, or they decided not to dedicate support to it for this reason. TensorFlow will go to great lengths to copy unpinned memory to a pinned staging buffer if you feed unpinned host memory tensors to a graph.

checker659

3 days ago

Are both using a PCIe switch? If one is and the other isn't, it could be about PCIe credit based flow control kicking in.

user

3 days ago

[deleted]

Surac

3 days ago

I recently had to route a PCB for a fpga using DDR3. It needed 3 designs to get the ram interface good. Dont get me wrong i have designed such things before but there are so may external factors. Now think of DDR of higher order. I think they are on the edge what can be done on todays PCB design

pyuser583

2 days ago

My experience with GPU failures is they have trouble loading tons of data - suggesting it’s also the stress such a highly performant part puts on the system.

eleventyseven

3 days ago

> Today, we’re sharing our GPU reliability system as both a demonstration of our commitment to Modal customers and as a guide for fellow travelers renting hyperscaler or neocloud cards. It’s dangerous to go alone! Take this.

> We’ve chosen not to refer to cloud providers directly, but instead give them anonymized A, B, C, D identifiers. If you want know who’s who, track the clues or buy us a beer sometime.

Come on, either name names or admit it is pure PR.

Edit: or will someone who can decode the clues weigh in?

pests

3 days ago

I was curious and don't know enough of the cloud internals so asked an LLM:

Cloud A: AWS (Amazon Web Services)

Cloud B: Azure (Microsoft Azure)

Cloud C: GCP (Google Cloud Platform)

Cloud D: OCI (Oracle Cloud Infrastructure)

Gemini had some decent evidence for each choice too, but I didn't confirm anything.

tryauuum

3 days ago

wouldn't they run on neoclouds? paying AWS for GPUs is shooting yourself in the foot money-wise

squeefers

3 days ago

i thought this kind of thing was the cloud providers job? this just seems to me very similar to the classic server rental model