xAI's 100k GPUs data center in Memphis is up and running

27 pointsposted 15 hours ago
by mfiguiere

5 Comments

_davide_

5 hours ago

I'm waiting for AI bubble to burst to buy some of those at the fraction of the price :)

melodyogonna

10 hours ago

This is interesting, I thought the data center came online weeks ago

blackeyeblitzar

11 hours ago

I assume there are others already running 100k GPU training clusters already, but this article claims xAI’s Colossus is the most powerful computer ever built, and that no other company has been able to connect this many GPUs together due to networking limitations. But haven’t there been other companies who’ve purchased more GPUs? Are they simply running them as smaller separate clusters?

xemdetia

5 hours ago

Smaller clusters are common because it depends on what you are trying to accomplish. It's the same problem as any distributed system- going wider with a map only works if you can perform a timely reduce. With checkpointing you run into things like network storage being too slow and so on. There's also the fact you might want to work on/with many models and what you want to do for production (inference) is different than what you try to accomplish in development (training).

michelb

3 hours ago

I think Meta’s cluster will be slightly larger than 100k next month, scaling to 350k by end of year. And double of that sometime next year. According to their info.