How oxide cuts data center power consumption in half

195 pointsposted 2 months ago
by tosh

184 Comments

KenoFischer

2 months ago

I really love Oxide to an unhealthy amount (it's become a bit of a meme among my colleagues), but sometimes I do wonder whether they went about their go-to-market the right way. They really tried to do everything at once - custom servers, custom router, custom rack, everything. Their accomplishments are technologically impressive, but, as somebody who is in a position to make purchasing decisions, not economically attractive. They're 3x more expensive than our existing hardware, two generations behind (I'm aware they're on track for a refresh) and don't have any GPUs. E.g. what I would have loved to see is just an after-market BMC/NIC/firmware solution using their stack. Plug it into a cheap Gigabyte system (their BMC is pluggable and NIC is OCP) and just have the control plane manage it as a whole box. I'd have easily paid serveral thousand $ per server just for that. All the rack scale integration, virtualization, migration, network storage, etc stuff is cool, but not everyone needs it. Get your foot in the door at customers, build up some volume for better deals with AMD, and then start building the custom rack stuff ... Of course it's easy to be a critic from the side lines. As I said, I do really love what the Oxide folks are doing, I just really hope it'll become possible for me to buy their gear at some point.

bcantrill

2 months ago

First, thanks for the love -- it's deeply appreciated! Our go-to-market is not an accident: we spent a ton of time (too much time?) looking at how every company had endeavored (and failed) in this space, and then considering a bunch of other options besides. Plugging into a "cheap Gigabyte" system wouldn't actually allow us to build what we've built, and we know this viscerally: before we had our system built, we had to have hardware to build our software on -- which was... a bunch of cheap Gigabyte systems. We had the special pain of relearning all of the reasons why we took the approach we've taken: these systems are a non-starter with respect to foundation.

You may very well not need the system that we have built, but lots of people do -- and the price point versus the alternatives (public cloud or on-prem commodity HW + pretty price SW) has proven to be pretty compelling. I don't know if we'll ever have a product that hits your price point (which sounds like... the cost of Gigabyte plus a few thousand bucks?), but at least the software is all open source!

KenoFischer

2 months ago

Please forgive my tergiversation. I fully trust that you know your path and I know how annoying it is to be why-dont-they-just'd. As I said, I'm rooting for you.

m463

2 months ago

> The meaning of TERGIVERSATION is evasion of straightforward action or clear-cut statement : equivocation

KenoFischer

2 months ago

There's two dictionary definitions of tergiversate. One is the one you quoted, the other is one of desertion. Both meanings of the word are pejorative in the sense that the word comes with a connotation of betrayal of a cause. What I wanted to express was an acknowledgement that I understood the feeling that you get when someone who's clearly a fan of your work nevertheless does not provide a clear endorsement. It's easy emotionally to dismiss people who "just don't get it". But when someone does get it but chooses to equivocate, that can feel like an emotional betrayal. So I was looking for a word that covered that with the right connotation. I originally used apostasy, but it didn't feel quite right, because I wasn't really renouncing, more failing to fully endorse, so tergiversation it was. Of course having to write an entire paragraph to explain your word choice kind of defeats the purpose of choosing a single well fitting word over just writing a sentence of simple words that explains what you mean. But hey, I write enough technical writing, documentation, reports, grants, etc. all day where clarity is paramount that I feel like I get to have a little vocabulary treat in my personal writing ;).

user

2 months ago

[deleted]

PeterCorless

2 months ago

So my question: any Arm-based system or GPU-based system on the horizon?

alberth

2 months ago

You just described why commodity servers won over engineered systems that came before Oxide (like Nutanix, Sun / Oracle Exa*, VCE etc).

So I totally agree with your go-to-market comment, because it’s also a bet against cloud.

I wish them luck though.

panick21_

2 months ago

And yet, non of the hyperscalers use commodity server. They are buying parts from the OCP but those are hardly 'commodity' servers. So did they win?

chambers

2 months ago

I kinda feel that their focus is more on building a great technology (& culture?) than a great business.

Not necessarily a bad choice; after all, for what shall it profit a man, if he shall gain the whole world, and lose his own soul?

intelVISA

2 months ago

Oxide are doing great work. Hoping they can probe the market a bit more for us out on the sidelines preparing to drop in and compete with some similar tech.

preisschild

2 months ago

Id also wish I could get to play around with a cheaper version of their tech, but they probably havw enough customers that really want a large-scale solution that is completely customizable

cdchn

2 months ago

I'm curious what their burn rate is.

unsnap_biceps

2 months ago

> When we started Oxide, the DC bus bar stood as one of the most glaring differences between the rack-scale machines at the hyperscalers and the rack-and-stack servers that the rest of the market was stuck with. That a relatively simple piece of copper was unavailable to commercial buyers

It seems that 0xide was founded in 2019 and Open Compute Project had been specifying dc bus bars for 6 years at that point. People could purchase racks if they wanted, but it seems like, by large, people didn't care enough to go whole hog in on it.

Wonder if the economics have changed or if it's still just neat but won't move the needle.

walrus01

2 months ago

Things like -48VDC bus bars in the 'telco' world significantly predate the OCP, all the way back to like 1952 in the Bell system.

In general, the telco world concept hasn't changed much. You have AC grid power coming from your local utility into some BIG ASS RECTIFIERS which create -48VDC (and are responsible for charging your BIG ASS BATTERY BANK to float voltage), then various DC fuses/breakers going to distribution of -48VDC bus bars powering the equipment in a CO.

Re: Open Compute, the general concept of what they did was go to a bunch of 1U/2U server power supply manufacturers and get them to make a series of 48VDC-to-12VDC power supplies (which can be 92%+ efficient), and cut out the need for legacy 5VDC feed from power supply into ATX-derived-design x86-64 motherboards.

m463

2 months ago

I remember seeing an old telephone switching system from the 20's and I think it was 48vdc. Uncertain though.

ttyprintk

2 months ago

Yeah, would have been 48 vdc for line operations, 60 and up AC for the ring.

indrora

2 months ago

You simply can't buy OCP hardware is part of the issue, not new anyway. What you're going to find is "OCP Inspired" hardware that has some overlap with the full OCP specification but is almost always meant to run on 240VAC on 19in racks because nobody wants to invest the money in something that can't be bought from CDW.

p_l

2 months ago

I remember the one time I had OCP hardware in data center, and how it was essentially rumoured it's better to not ask too much how it got there - not the level of "fell of a truck", but some possibility it was ex-(big tech) equipment acquired through favours, or some really insistent negotiating with Quanta till "to be sold to (big tech)" racks ended up with us

zamalek

2 months ago

It's normally incredibly difficult for employees to disrupt at massive companies that would be the type which runs a data center. Disruption usually enters the corp in a sales deck, much like the one Oxide would have.

It's stupid, but that's why we all have jobs.

hnthrowaway0328

2 months ago

I think engineers should be more forceful to lead their own visions instead being led by accountants and lawyers.

After engineers have the power of implementation and de-implementstion. They need to step into dirty politics and bend other people's views.

It's either theirs or ours. Win-win is a fallacy.

andrewjf

2 months ago

Being able to navigate this is what differentiates a very senior IC (principal, distinguished, etc) and random employees.

orochimaaru

2 months ago

Yes. I think as an engineer at this level you need to also have the patience to deal with the bean counters.

But as I’ve grown in my career I’ve actually found that line of thinking refreshing. Can you quantify benefit? If it requires too many assumptions it’s probably not worth it.

But then again there’s always the Vp or the svp who wants to “showcase his towers’ innovative spirit” and then there goes money that could be used for better things. The innovative spirit of the day is random Llm apps.

philipov

2 months ago

Let me know how that works out for you!

hinkley

2 months ago

Once the accountants are convinced the entire company is about them, there’s not much the engineers can do. They just starve you out by refusing to buy anything. It’s a big reason why open source is as successful as it is. It’s free so they can’t stop you with the checkbook.

bigfatkitten

2 months ago

OCP hardware is only really accessible to hyperscalers. You can't go out and just buy a rack or two, the Taiwanese OEMs don't do direct deals that small. Even if they did, no integration is done for you. You would have to integrate the compute hardware from one company, the network fabric from another company, and then the OS and everything else from yet another. That's a lot of risk, a lot of engineering resources, a lot of procurement overhead, and a lot of different vendors pointing fingers at each other when something doesn't work.

If you're Amazon or Google, you can do this stuff yourself. If you're a normal company, you probably won't have the inhouse expertise.

On the other hand, Oxide sells a turnkey IaaS platform that you can just roll off the pallet, plug in and start using immediately. You only need to pay one company, and you have one company to yell at if something goes wrong.

You can buy a rack of 1-2U machines from Dell, HPE or Cisco with VMware or some other HCI platform, but you don't get that power efficiency or the really nice control plane Oxide have on their platform.

leoc

2 months ago

But isn’t it a little surprising (I’m not an expert) that Dell or Supermicro or somefirm like that hadn’t already started offering an approachable access to either OCP gear or a proprietary knockoff of it? Presumably that may still happen if Oxide is seen to have proven the market.

kjellsbells

2 months ago

Azure tried this, not with their hyperscaler stuff, but with Azure Operator Nexus.

Basically an "opinionated" combination of Dell, Arista, and Pure storage with a special Azure AKS running on top and a metric ton of management and orchestration smarts. The target customer base was telcos who needed local capabilities in their data centers and who might otherwise have gone to OCP.

As far as I can surmise, it's dead, but not EOLed. Microsoft nuked the operator business unit earlier in the year, and judging by recent job postings from contract shops, AT&T might be the only customer.

panick21_

2 months ago

These companies are looked into their way of doing things. Also, they would be competing with themselves. It would also require more work on their side then they do now.

I think the whole 'existing company is not doing something, therefore its a bad idea' is a really dangerous take.

Oxide is also not just exactly, OCP, they share some aspects, but Oxide racks are optimized for typical DC of large organizations. Maybe there is a balance there that matters.

Sylamore

2 months ago

HP BladeSystem p-series chassis were all DC bus bar powered back in the mid 2000s. You had a power enclosure which provided DC output to one or more chassis in a rack over the bus bar. We were glad to be rid of those blades but it wasn't because of their power configuration.

TZubiri

2 months ago

One is the specs and the other is an actual implementation, what am I missing?

walrus01

2 months ago

They do have a good point here. If you do the total power budget on a typical 1U (discrete chassis, not blade) server which is packed full of a wall of 40mm fans pushing air, the highest speed screaming 40mm 12VDC fans can be 20W electrical load each. It's easy to "spend" at least 120W at maximum heat from the CPUs, in a dual socket system, just on the fans to pull air from the front/cold side of the server through to the rear heat exhaust.

Just going up to 60mm or 80mm standard size DC fans can be a huge efficiency increase in watt-hours spent per cubic meters of air moved per hour.

I am extremely skeptical of the "12x" but using larger fans is more efficient.

from the URL linked:

> Bigger fans = bigger efficiency gains Oxide server sleds are designed to a custom form factor to accommodate larger fans than legacy servers typically use. These fans can move more air more efficiently, cooling the systems using 12x less energy than legacy servers, which each contain as many as 7 fans, which must work much harder to move air over system components.

eaasen

2 months ago

FWIW, we had to have the idle speed of our fans lowered because the usual idle of around 5k RPM was WAY too much cooling. We generally run our fans at around 2.5kRPM (barely above idle). This is due to not only the larger fans, but also the fact that we optimized and prioritized as little restriction on airflow as possible. If you’ve taken apart a current gen 1U/2U server and then compare that to how little our airflow is restricted and how little our fans have to work, the 12X reduction becomes a bit clearer.

znpy

2 months ago

> the usual idle of around 5k RPM was WAY too much cooling.

What does this mean? Can one actually get too much cooling? Do you get like condensation and stuff, that kind of "too much cooling" ?

I'm not being snarky, i actually don't know.

kardos

2 months ago

It must mean cooling significantly below the target temperature, and thus wasting power to do it

znpy

2 months ago

I see, thank you!

LeoPanthera

2 months ago

I really wish Oxide had homelab/prosumer grade stuff. I'd be sending them so much money.

hinkley

2 months ago

I kinda feel we need minicomputers back in this age of computing. Instead of making one giant rack that doesn’t fit through doorways, they should make a 4 ft tall unit that stacks. At least once they’re established enough that they can manage doing small installs instead of full data centers. I’ve looked around and there are tiny forklifts they could use to install 2 at once.

Just the power demands for their full rack exceed capacity for most office spaces.

That and someone needs to make a rack that has a port to plug a glycol line directly into. Doesn’t have to be Oxide, but someone should.

VTimofeenko

2 months ago

A ~20U rack working off residential 15/20A would have been so cool.

Though given how it's designed for the datacenters, I'd expect the thing to be pretty darn loud.

steveklabnik

2 months ago

> Though given how it's designed for the datacenters, I'd expect the thing to be pretty darn loud.

It's actually very much the opposite: the rack is very, very quiet. You can hear for yourself: https://www.youtube.com/watch?v=bYcgPRIWf6I

VTimofeenko

2 months ago

That is quiet, indeed! Have you done any decibel measurements by any chance? I wonder how loud it would be when compared to just ambient residential noise level.

steveklabnik

2 months ago

I don't remember off the top of my head.

It's quiet enough that one customer is putting one just straight-up on their office floor, rather than in a colo somewhere. I've stood next to one in our office (which is a big garage, no soundproofing, so sound otherwhere bounces around a lot) and had conversations easily.

VTimofeenko

2 months ago

Thanks for the info, it would definitely pass the WAF gate from that perspective :)

TabTwo

2 months ago

Isn't most of their stuff open source?

steveklabnik

2 months ago

It is, but if you're running on different hardware than us, you'd have to do a bunch of porting. Buying a solution would be a lot simpler, as we'd have already done the porting.

Gormo

2 months ago

Have you thought of building an affordable small-scale product for home labs and maybe SMBs? Even if that line didn't turn a profit, it could function as a loss leader in getting engineers and consultants familiar with Oxide, and an opportunity to experiment with (and ultimately evangelize) your tech stack without needing to already have an enterprise-scale use case.

steveklabnik

2 months ago

In general, we love the love we get from homelab folks, but the issue is that the current thesis of our designs is "take advantage of the scale of building at the full-rack level."

We really can't afford to do loss leaders before we have more of a business. It's already difficult enough to build a company like this, and that's with making money off of sales. I fully agree that in general, this idea completely makes sense, but you can only really employ it once you have a business to be able to absorb those losses. Right now, building and selling the current product takes up 110% of our time.

Gormo

2 months ago

I respect that, and I hope you get to that point! As a tech leader in a organization that currently falls short of the scale we'd need to justify Oxide products, I'm hoping that day comes soon.

We're getting to the point where people are building large clusters of Raspberry Pis and the like for hobbyist projects, so I hope that within a few years, the concept of "full-rack level" can encompass hardware with hundreds of nodes small and cheap enough to be packed into a "rack" that still fits under a desk and sells for a couple grand.

In the meantime, I'll guess I'll have to settle for exploring your code and listening to your podcast!

renewiltord

2 months ago

What I don't get is why tie to such an ancient platform. AMD Milan is my home lab. The new 9004 Epycs are so much better on power efficiency. I'm sure they've done their market research and the gains must be so significant. We used to have a few petabytes and tens of thousands of cores almost ten years ago and it's crazy how much higher data and compute density you can get with modern 30 TiB disks and Epyc 9654s. 100 such nodes and you have 10k cores and really fast data. I can't see myself running a 7003-series datacenter anymore unless the Oxide gains are that big.

farawayea

2 months ago

They've built this a while ago. A hardware refresh takes time. The good news is that they may be able to upgrade the existing equipment with newer sleds.

jclulow

2 months ago

Yes we're definitely building the next generation of equipment to fit into the existing racks!

znpy

2 months ago

my undestanding is that they had to build not only the entire hardware platform from scratch, but also the software.

in one of his talks Bryan Cantrill talks about how AMD cpus were meant to be booted off a uefi microcode, and AMD themselves told them such... Until they kinda reverse engineered the AGESA thingy and made the cpu boot without bios/uefi.

I guess that's the kind of things that take a lot of time... the first time. In the future they'll likely to be iterating faster.

EDIT: i wrote the comment above to the best of my knowledge, somebody from Oxide might chime in and maybe add some more details :)

zcw100

2 months ago

I believe the telco’s did dc power for years so I don’t think this anything new. Any old hands out there want to school us on how it was done in the old days?

iamthepieman

2 months ago

Every old telco technician had a story about dropping a wrench on a busbar or other bare piece of high powered transmission equipment and having to shut that center down, get out the heavy equipment, and cut it off because the wrench had been welded to the bus bars.

jclulow

2 months ago

Note that the rack doesn't accept DC input, like lots of (e.g., NEBS certified) telco equipment. There's a bus bar, but it's enclosed within the rack itself. The rack takes single- or three-phase AC inputs to power the rectifiers, which are then attached to the internal bus bar.

walrus01

2 months ago

big ass rectifiers

big ass solid copper busbars

huge gauge copper cables going around a central office (google "telcoflex IV")

big DC breaker/fuse panels

specialized dc fuse panels for power distribution at the top of racks, using little tiny fuses

100% overhead steel ladder rack type cable trays, since your typical telco CO was never a raised floor type environment (UNLIKE legacy 1960s/1970s mainframe computer rooms), so all the power was kept accessible by a team of people working on stepladders.

The same general thing continues today in serious telco/ISP operations, with tech features to bring it into the modern era. The rectifiers are modular now, and there's also rectiverters. Monitoring is much better. People are moving rapidly away from wet cell 2V lead acid battery banks and AGM sealed lead acid stuff to LiFePo4 battery systems.

DC fuse panels can come with network-based monitoring, ability to turn on/off devices remotely.

equipment is a whole lot less power hungry now, a telco CO that has decommed a 5ESS will find itself with a ton of empty thermal and power budget.

when I say serious telco stuff is a lot less power hungry, it's by huge margins. randomly chosen example of radio transport equipment. For instance back in the day a powerful, very expensive point to point microwave radio system might be a full 42U rack, 800W in load, with waveguide going out to antennas on a roof. It would carry one, two or three DS3 equivalent of capacity (45 Mbps each).

now, that same telco might have a radio on its CO roof in the same microwave bands that is 1.3 Gbps FDD capacity, pure ethernet with a SFP+ fiber interface built into it, and the whole radio is a 40W electrical load. The radio is mounted directly on the antenna with some UV/IR resistant weatherproof 16 gauge DC power cable running down into the CO and plugged into a fuse panel.

applied_heat

2 months ago

Can you give me a link to this 1.3 gbps radio product? I have some Alcatel radios with waveguides on a licensed band that only do 50 megabit that I would upgrade if there was something that could get more bits out of the same bandwidth and towers.

walrus01

2 months ago

Ceragon is one brand name. If you need to keep an entirely indoor unit radio in a rack with the existing waveguide it'll cost a little more, since that's a more rare configuration for new 4096QAM modulation radios.

The 1.3 Gbps full duplex capacity assumes dual linear H&V polarization simultaneously, and assumes an 80 MHz wide FDD channel split such as in the 11 GHz high/low band plan. If you're in FCC part 101 regulatory band territory, and what frequency your existing radios use and existing path, you might not have that capacity. You could have an existing 40 MHz wide channel which will be half the capacity.

If you have a 50 Mbps radio product it's also very likely you're in a single polarity so you would need to recoordinate the path (around $1500) entirely to get the same MHz in the opposite polarity.

EvanAnderson

2 months ago

I don't have a link handy (on my phone), but I was involved in installs of licensed Cambium 18Ghz radios last year that were pushing >1Gbps. PTP-800 was the model number, if memory serves.

hinkley

2 months ago

The first large scale app I did we got offices in a building that used to have telco equipment in it. There wasn’t enough power or cooling to run about a rack worth of equipment split across several. It basically had a mini-split for AC. We had to bring in new wiring and run a glycol line to a condenser on the roof, and the smallest unit we were willing to pay for was too big so we had to knock out a wall to tack a reasonable sized office onto the end to get the volume large enough. So much wasted space for the amount of equipment in there.

farawayea

2 months ago

Their tech may be more than adequate today. Bigger businesses may not buy from a small startup company. They expect a lot more. Illumos is a less popular OS. It wouldn't be the first choice for the OS I'd rely on. Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?

AlotOfReading

2 months ago

The answer to "who does X" is Oxide. That's the point. You're not going to Dell who's integrating multiple vendors in the same box in a way that "should" work. You're getting a rack where everything is designed to work together from top to bottom.

The goal is that you can email Oxide and they'll be able fix it regardless of where it is in the stack, even down to the processor ROM.

toomuchtodo

2 months ago

This. If you want on prem cloud infra without having to roll it yourself, Oxide is the solution.

(no affiliation, just a fan)

carlhjerpe

2 months ago

If you want on prem infra in exactly the shape and form Oxide delivers*

I've read and understood from Joyent and SmartOS that they believe fault tolerant block devices / filesystems is the wrong abstraction, your software should handle losing storage.

eaasen

2 months ago

We do not put the onus on customers to tolerate data loss. Our storage is redundant and spread through the rack so that if you lose drives or even an entire computer, your data is still safe. https://oxide.computer/product/storage

user

2 months ago

[deleted]

panick21_

2 months ago

They have partly changed their position on that. You can listen to their podcast on their distributed block storage solution.

yencabulator

2 months ago

And a big enough customer will evaluate Oxide's resources and consider for themselves whether they think Oxide can provide a quick enough turnaround for everything. That's what GP is talking about.

throw0101d

2 months ago

> Bigger businesses may not buy from a small startup company.

What would you classify Shopify as?

> One existing Oxide user is e-commerce giant Shopify, which indicates the growth potential for the systems available.

* https://blocksandfiles.com/2024/07/04/oxide-ships-first-clou...

Their CEO has tweeted about it:

* https://twitter.com/tobi/status/1793798092212367669

> Who writes the security mitigations for speculative execution bugs? Who patches CVEs in the shipped software which doesn't use Rust?

Oxide.

This is all a pre-canned solution: just use the API like you would an off-prem cloud. Do you worry about AWS patching stuff? And how many people purchasing 'traditional' servers from Dell/HPe/Lenovo worry about patching links like the LOM?

Further, all of Oxide's stuff is on Github, so you're in better shape for old stuff, whereas if the traditional server vendors EO(S)L something firmware-wise you have no recourse.

cdchn

2 months ago

How much did Shopify buy? Sounds like from what the CEO is saying they bought 1 unit.

>We learned that Oxide has so far shipped “under 20 racks,” which illustrates the selective markets its powerful systems are aimed at.

>B&F understands most of those systems were deployed as single units at customer sites. Therefore, Oxide hopes these and new customers will scale up their operations in response to positive outcomes.

Yikes. If they sold 20 racks in July, how many are they up to now?

packetlost

2 months ago

Illumos is the OS for the hypervisor and core services, they don't expect their customers to run their code directly on that OS, but inside VMs.

steveklabnik

2 months ago

> Bigger businesses may not buy from a small startup company.

Our early customers include government, finance, and places like Shopify.

You’re not wrong that some places may prefer older companies but that doesn’t mean they all do.

Illumos is not really directly relevant to the customer, it’s a non user facing implementation detail.

We provide security updates.

mycoliza

2 months ago

We write the security mitigations. We patch the CVEs. Oxide employs many, perhaps most, of the currently active illumos maintainers --- although I don't work on the illumos kernel personally, I talk to those folks every day.

A big part of what we're offering our customers is the promise that there's one vendor who's responsible for everything in the rack. We want to be the responsible party for all the software we ship, whether it's firmware, the host operating system, the hypervisor, and everything else. Arguably, the promise that there's one vendor you can yell at for everything is a more important differentiator for us than any particular technical aspect of our hardware or software.

sunshowers

2 months ago

The illumos bare-metal OS is not directly visible to customers.

arpinum

2 months ago

How long before a VPS pops up running Oxide racks? Or, why wouldn't a VPS build on top of Oxide if they offer better efficiency and server management?

steveklabnik

2 months ago

Someone could if they wanted to! We’ll see if anyone does.

INTPenis

2 months ago

Because they use such esoteric software that you'll forever be reliant on Oxide.

I'd rather they use more standardized open source software like Linux, Talos, k8s, Ceph, KubeVirt. Instead of rolling it all themselves on an OS that has a very small niche ecosystem.

AceJohnny2

2 months ago

Oxide is providing an x86 platform to run VMs/containers on. That's a commoditized market.

The value they're offering is that the rack-level consumption and management is improved over the competition, but you should be able to run whatever you want on the actual compute, k8s or whatnot.

This also means you'd not be forever reliant on Oxide.

user

2 months ago

[deleted]

louwrentius

2 months ago

I’m rooting for solutions like this as an alternative to the public cloud. I do see that an org would rely on one company that theoretically can do a ‘Broadcom VMware’ on them but I don’t get this vibe from 0x1d3 at all.

But they target large orgs, I wish a solution like this would be accessible for smaller companies.

I wish I could throw their stack on my second hand cots hardware, rent a few U’s in two colos for geo redundancy and cry of happiness each month realizing how much money we save on public cloud cost, yet having cloud capabilities/benefits

huijzer

2 months ago

> Here’s a sobering thought: today, data centers already consume 1-2% of the world’s power, and that percentage will likely rise to 3-4% by the end of the decade.

I don't get this marketing angle. I've made arguments here before that the cost of compute from a energy perspective is often negligible. If Google Maps, for example, can save you 1 mile due to better routing, then that is several orders of magnitude more efficient [1].

If it uses less resources, it uses less resources. Everybody (businesses and individuals) loves that.

[1]: https://news.ycombinator.com/threads?id=huijzer&next=4206549...

adgjlsfhk1

2 months ago

both are true. using computers to reduce emissions is good, and reducing computer emissions is good.

grecy

2 months ago

I'm amazed Apple don't have a rack mount version of their M series chips yet.

Even for their own internal use in their data centers they'd have to save an absolute boat load on power and cooling given their performance per watt compared to legacy stuff.

bayindirh

2 months ago

Oxide is not touching DLC systems in their post even with a 100ft barge pole.

Lenovo's DLC systems use 45 degrees C water to directly cool the power supplies and the servers themselves (water goes through them) for > 97% heat transfer to water. In cooler climates, you can just pump this to your drycoolers, and in winter you can freecool them with just air convection.

Yes, the TDP doesn't go down, but cooling costs and efficiency shots up considerably, reducing POE to 1.03 levels. You can put tremendous amount of compute or GPU power in one rack, and cool them efficiently.

Every chassis handles its own power, but IIRC, all the chassis electricity is DC. and the PSUs are extremely efficient.

hinkley

2 months ago

The in case PSUs I’ve seen them gesturing to in videos don’t even seem to have cooling fins on them.

walrus01

2 months ago

Companies buying massive cloud scale server hardware want to be able to choose from a dozen different Taiwanese motherboard manufacturers. Apple is in no way motivated to release or sell the M3/M4 CPUs as a product that major east asia motherboard manufacturers can design their own platform for. Apple is highly invested in tightly integrated ecosystems where everything is soldered down together in one package as a consumer product (take a look at a macbook air or pro motherboard for instance).

vineyardmike

2 months ago

…Apple has made rack-mounted computers in recent history. They don’t sell chips, they sell complete boxes with rack mount hardware, motherboard and all.

https://www.apple.com/shop/product/G1720LL/A/Refurbished-Mac...

walrus01

2 months ago

An extremely niche product for things like video editing studios, not something you can deploy at scale in colocation/datacenter environments. Literally never seen rackmounted apple hardware in a serious datacenter since the apple xserve 20 to 22 years ago.

rincebrain

2 months ago

I don't think they'd admit much about it even if they had one internally, both because Apple isn't known for their openness about many things, and because they already exited the dedicated server hardware business years ago, so I think they're likely averse to re-entering it without very strong evidence that it would be beneficial for more than a brief period.

In particular, while I'd enjoy such a device, Apple's whole thing is their whole-system integration and charging a premium because of it, and I'm not sure the markets that want to sell people access to Apple CPUs will pay a premium for a 1U over shoving multiple Mac Minis in the same 1U footprint, especially if they've already been doing that for years at this point...

...I might also speculate that if they did this, they'd have a serious problem, because if they're buying exclusive access to all TSMC's newest fab for extended intervals to meet demand on their existing products, they'd have issues finding sources to meet a potentially substantial demand in people wanting their machines for dense compute. (They could always opt to lag the server platforms behind on a previous fab that's not as competed with, of course, but that feels like self-sabotage if they're already competing with people shoving Mac Minis in a rack, and now the Mac Minis get to be a generation ahead, too?)

AceJohnny2

2 months ago

I will add that consumer macOS is a piss-poor server OS.

At one point, for many years, it would just sometimes fail to `exec()` a process. This would manifest as a random failure on our build farm about once/twice a month. (This would manifest as "/bin/sh: fail to exec binary file" because the error type from the kernel would have the libc fall back to trying to run the binary as a script, as normal for a Unix, but it isn't a script)

This is likely stemming from their exiting the server business years ago, and focusing on consumer appeal more than robustness (see various terrible releases, security- and stability-wise).

(I'll grant that macOS has many features that would make it a great server OS, but it's just not polished enough in that direction)

AceJohnny2

2 months ago

> as normal for a Unix

veering offtopic, did you know macOS is a certified Unix?

https://www.opengroup.org/openbrand/register/brand3581.htm

As I recall, Apple advertised macOS as a Unix without such certification, got sued, and then scrambled to implement the required features to get certification as a result. Here's the story as told by the lead engineer of the project:

https://www.quora.com/What-goes-into-making-an-OS-to-be-Unix...

jorams

2 months ago

This comes up rather often, and on the last significant post about it I saw on HN someone pointed out that the certification is kind of meaningless[1]. macOS poll(2) is not Unix-compliant, hasn't been since forever, yet every new version of macOS gets certified regardless.

[1]: https://news.ycombinator.com/item?id=41823078

znpy

2 months ago

lovely, i favorited that comment!

autoexecbat

2 months ago

and Windows used to be certified for posix, but none of that matters theses days if it's not bug-compatible with Linux

rincebrain

2 months ago

Did that ever get fixed? That...seems like a pretty critical problem.

AceJohnny2

2 months ago

Yes, it quietly stopped happening a few years ago, sometime since 2020.

outworlder

2 months ago

> I will add that consumer macOS is a piss-poor server OS.

Windows is also abysmal but it hasn't stopped people from using it.

But yes, it is too much of a desktop OS.

toast0

2 months ago

I wouldn't run a Windows server, but at least it can manage a SYN flood, whereas MacOS doesn't have syncookies or similar (their version of pf has the syncookie keyword, but it seems like it only works for traffic that transits the host, not for traffic that is terminated by the host). Windows also has some pretty nice stuff for servers like receive side scaling (afaik, Microsoft brought that to market, or at least was very early).

thatfrenchguy

2 months ago

There is a rack mount version of the Mac Pro you can buy

bigfatkitten

2 months ago

That's designed for the broadcast market, where they rack mount everything in the studio environment. It's not really a server, it has no out of band management, redundant power etc.

There are third party rack mounts available for the Mac Mini and Mac Studio also.

wpm

2 months ago

Rack mount models have LOM over MDM.

jauntywundrkind

2 months ago

For who? How would this help their core mission?

Maybe it becomes a big enough profit center to matter. Maybe. At the risk of taking focus away, splitting attention from the mission they're on today: building end user systems.

Maybe they build them for themselves. For what upside? Maybe somewhat better compute efficiency maybe, but I think if you have big workloads the huge massive AMD Turin super-chips are going to be incredibly hard to beat.

It's hard to emphasize just how efficient AMD is, with 192 very high performance cores on a 350-500W chip.

favorited

2 months ago

> Maybe they build them for themselves. For what upside?

They do build it for themselves. From their security blog:

"The root of trust for Private Cloud Compute is our compute node: custom-built server hardware that brings the power and security of Apple silicon to the data center, with the same hardware security technologies used in iPhone, including the Secure Enclave and Secure Boot. We paired this hardware with a new operating system: a hardened subset of the foundations of iOS and macOS tailored to support Large Language Model (LLM) inference workloads while presenting an extremely narrow attack surface. This allows us to take advantage of iOS security technologies such as Code Signing and sandboxing."

<https://security.apple.com/blog/private-cloud-compute/>

jauntywundrkind

2 months ago

This is such a narrow narrow tiny corner of computing needs. That has such serious need for ownership, no matter the cost. And has extremely fantastically chill as shit overall computing needs, is un-perfomamce-sensitive as it gets.

I could not be less convinced by this information that this is a useful indicator for the other 99.999999999% of computing needs.

favorited

2 months ago

Good, because you can’t have one.

shivak

2 months ago

> > The power shelf distributes DC power up and down the rack via a bus bar. This eliminates the 70 total AC power supplies found in an equivalent legacy server rack within 32 servers, two top-of-rack switches, and one out-of-band switch, each with two AC power supplies

This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff. In general, the cost savings advertised by cloud infrastructure should be more holistic.

dralley

2 months ago

>This creates a single point of failure, trading robustness for efficiency. There's nothing wrong with that, but software/ops might have to accommodate by making the opposite tradeoff.

I'll happily take a single high qualify power supply (which may have internal redundancy FWIW) over 70 much more cheaply made power supplies that stress other parts of my datacenter via sheer inefficiency, and also costs more in aggregate. Nobody drives down the highway with 10 spare tires for their SUV.

shivak

2 months ago

A DC busbar can propagate a short circuit across the rack, and DC circuit protection is harder than AC. So of course each server now needs its own current limiter, or a cheap fuse.

But I’m not debating the merits of this engineering tradeoff - which seems fine, and pretty widely adopted - just its advertisement. The healthcare industry understands the importance of assessing clinical endpoints (like mortality) rather than surrogate measures (like lab results). Whenever we replace “legacy” with “cloud”, it’d be nice to estimate the change in TCO.

malfist

2 months ago

DC circuit protection is absolutely not harder than AC. DC has the advantage in current flowing in only one direction, not two

paddy_m

2 months ago

Which makes it much harder to break the circuit vs AC

wbl

2 months ago

At 48 volts arcing shorts aren't the concern.

fracus

2 months ago

No one drives down the highway with one tire either.

AcerbicZero

2 months ago

Careful, unicyclists are an unforgiving bunch.

hn-throw

2 months ago

Let's say your high quality supply's yearly failure rate is 100 times less than the cheap ones

The probability of at least a single failure is 1-(1-r)^70.

This is quite high even w/out considering the higher quality of the one supply.

The probability of all 70 going down is

r^70 which is absurdly low.

Let's say r = 0.05 or one failed supply every 20 in a year.

1-(1-r)^70 = 97% r^70 < 1E-91

The high quality supply has r = 0.0005, in between no failure and all failing. If you code can handle node failure, very many, cheaper supplies appears to be more robust.

(Assuming uncorrelated events. YMMV)

carlhjerpe

2 months ago

Yeah but the failure rate of an analog piece of copper is pretty low, it'll keep being copper unless you do stupid things. You'll have multiple power supplies provide power on the same piece of copper

hn-throw

2 months ago

TL/DR, isnt there a single, shared, DC supply that supplies said piece of copper? Presumably connected to mains?

Or are the running on SOFCs?

mycoliza

2 months ago

The big piece of copper is fed by redundant rectifiers. Each power shelf has six independent rectifiers which are 5+1 redundant if the rack is fully loaded with compute sleds, or 3+3 redundant if the rack is half-populated. Customers who want more redundancy can also have a second power shelf with six more rectifiers.

hn-throw

2 months ago

I'm going to assume this is on 3 phase power, but how is the ripple filtered?

sunshowers

2 months ago

Look very carefully at the picture of the rack at https://oxide.computer/ :) there are two power shelves in the middle, not one.

We're absolutely aware of the tradeoffs here and have made quite considered decisions!

jsolson

2 months ago

The bus bar itself is an SPoF, but it's also just dumb copper. That doesn't mean that nothing can go wrong, but it's pretty far into the tail of the failure distribution.

The power shelf that keeps the busbar fed will have multiple rectifiers, often with at least N+1 redundancy so that you can have a rectifier fail and swap it without the rack itself failing. Similar things apply to the battery shelves.

immibis

2 months ago

It's also plausible to have multiple power supplies feeding the same bus bar in parallel (if they're designed to support this) e.g. one at each end of a row.

eaasen

2 months ago

This is how our rack works (Oxide employee). In each power shelf, there are 6 power supplies and only 5 need to be functional to run at full load. If you want even more redundancy, you can use both power shelves with independent power feeds to each so even if you lose a feed, the rack still has 5+1 redundant power supplies.

walrus01

2 months ago

The whole thing with eliminating 70 discrete 1U server size AC-to-DC power supplies is nothing new. It's the same general concept as the power distribution unit in the center of an open compute platform rack design from 10+ years ago.

Everyone who's doing serious datacenter stuff at scale knows that one of the absolute least efficient, labor intensive and cabling intensive/annoying ways of powering stuff is to have something like a 42U cabinet with 36 servers in it, each of them with dual power supplies, with power leads going to a pair of 208V 30A vertical PDUs in the rear of the cabinet. It gets ugly fast in terms of efficiency.

The single point of failure isn't really a problem as long as the software is architected to be tolerant of the disappearance of an entire node (mapping to a single motherboard that is a single or dual cpu socket config with a ton of DDR4 on it).

formerly_proven

2 months ago

That’s one reason why 2U4N systems are kinda popular. 1/4 the cabling in legacy infrastructure.

jeffbee

2 months ago

PDUs are also very failure-prone and not worth the trouble.

sidewndr46

2 months ago

This isn't even remotely close. Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.

In the event that all 32 servers had redundant AC power feeds, you could just install a pair of redundant DC power feeds.

gruez

2 months ago

>Unless all 32 servers have redundant AC power feeds present, you've traded one single point of failure for another single point of failure.

Is this not standard? I vaguely remember that rack severs typically have two PSUs for this reason.

glitchcrab

2 months ago

It's highly dependent on the individual server model and quite often how you spec it too. Most 1U Dell machines I worked with in the past only had a single slot for a PSU, whereas the beefier 2U (and above) machines generally came with 2 PSUs.

thfuran

2 months ago

But 2 PSUs plugged into the same AC supply still have a single point of failure.

glitchcrab

2 months ago

Which is why you have two separate PDUs in the rack which are fed by different power feeds and you connect the server's 2 PSUs to opposing PDUs.

growse

2 months ago

This works brilliantly, right up to the point where your A side fails, and every single server suddenly doubles their demand on B.

Better have good capacity management so you don't go over 100% on B when that happens! (I've seen it happen and take a DC out).

jeffbee

2 months ago

Rack servers have two PSUs because enterprise buyers are gullible and will buy anything. Generally what happens in case of a single PSU failure is the other PSU also fails or it asserts PROCHOT which means instead of a clean hard down server you have a slow server derping along at 400MHz which is worse in every possible way.

sidewndr46

2 months ago

you could have 15 PSUs in a server. It doesn't mean they have redundant power feeds

MisterTea

2 months ago

> This creates a single point of failure,

Who told you there is only one PSU in the power shelf?

user

2 months ago

[deleted]

user

2 months ago

[deleted]

ZeroCool2u

2 months ago

If any Oxide staff are here, I'm just curious, is BlueSky a customer? Seems like it would fit well with their on-prem setup.

mkeeter

2 months ago

Nope, but many of us (Oxide staff) are big fans of what Bluesky is doing!

One of the Bluesky team members posted about their requirements earlier this month, and why Oxide isn't a great fit for them at the moment:

https://bsky.app/profile/jaz.bsky.social/post/3laha2upw3k2z

ZeroCool2u

2 months ago

Appreciate the reply! Been following Oxide for a few years now and really enjoy the technical blogs :)

AceJohnny2

2 months ago

> Also prices don't make sense for us.

Oof.

tptacek

2 months ago

Why is that "oof"? They're using commodity servers today. Oxide does not offer commodity servers.

AceJohnny2

2 months ago

Just that it highlights the challenge that Oxide faces, that they're effectively offering a "luxury" product in a deeply commoditized space.

tptacek

2 months ago

That's true if you think the market is SaaS upstarts like Bluesky and maybe less true if you think of the market in terms of who buys hardware. I remember early on at Matasano working for a house account, a major US corp that isn't a household name, and being shocked 2 years in when I finally had to do something in their data center (a FCIP appliance assessment) and seeing how much they'd spent on it. Look at everyone who runs (and wishes they weren't) z/OS today, or Oracle. There's more of them than I think a lot of HN people think.

cplwankery

2 months ago

Good on 0x1d5 to bring back the era of expensive, proprietary hardware that everybody loved so much.

danpalmer

2 months ago

Not Oxide or Bluesky, but firstly I'd suggest that asking the company about their customers is unlikely to get a response, most companies don't disclose their customers. Secondly, Bluesky have been growing quickly, I can only assume their hardware is too, and that means long lead time products like an Oxide rack aren't going to work, especially when you can have an off the shelf machine from Dell delivered in a few days.

steveklabnik

2 months ago

Oxide is very open, we are happy to talk about customers that allow us to talk about them. Some don’t want to, others are very happy to be mentioned, just like any other company.

danpalmer

2 months ago

> we are happy to talk about customers that allow us to talk about them

This is what I meant by "don't disclose", I didn't mean that Oxide was in any way secretive, but that usually this stuff doesn't get agreed, and that it would make more sense to ask the customer rather than the company selling as Oxide won't want to disclose unless there's already an agreement in place (formal or otherwise).

steveklabnik

2 months ago

Gotcha. That totally makes sense, I would t have thought about it that way.

ramon156

2 months ago

> most companies dont disclose their customers

In my head I'm imagining an average landing page. They slap their customers on there like stickers. I doubt bluesky would stay secretive about using oxide if they did

slyall

2 months ago

Those customers listed on the front page of companies are there as part of an agreement. Usually something like a discount. Certainly they are not listed without permission. 10x that if it is a case study.

danpalmer

2 months ago

I think they often are listed without permission unfortunately, and often literally based on on the the email addresses of people signing up for a trial. I see my company's logo on the landing page of many products that we don't use or may even have a policy preventing our use of.

tptacek

2 months ago

events.bsky appears to be hosted on OVH. Single-product SAAS companies less than a few years old are unlikely to be a major customer cohort for Oxide.

ccorcos

2 months ago

From the title, I was expecting to read about how oxidation (aka rust) reduces power throughput capacity

rajnathani

2 months ago

Is this just the main reason?

> Replacing low-efficiency AC power supplies with a high-efficiency DC Bus Bar

The part after it about better cooling fans, meh, there are more efficient liquid-cooling methods including immersion-cooling which are already there in implementation albeit niche.

kev009

2 months ago

Where is the GPU?

steveklabnik

2 months ago

We don’t currently have GPUs in the product. The closed-ness of the GPU space is a bit of a cultural difference, but we’ll surely have something eventually. As a small company, we have to focus on our strengths, and there’s plenty of folks who don’t need GPUs right now.

kev009

2 months ago

That's fine, just awkward because the GS report shows the TAM or problem depending on your perspective being accelerated computing.

steveklabnik

2 months ago

For sure. It’s not just GPUs; given that we have one product with three SKUs, there’s a variety of workloads we won’t be appropriate for just yet. Just takes time to diversify the offering.

kev507

2 months ago

maybe the real GPU was the friends we made along the way

PreInternet01

2 months ago

"If only they used DC from the wall socket, all those H100s would be green" is, not, I think, the hill you want to die on.

But, yeah, my three 18MW/y racks agree that more power efficiency would be nice, it's just that Rewrite It In (Safe) Rust is unlikely to help with that...

yjftsjthsd-h

2 months ago

> it's just that Rewrite It In (Safe) Rust is unlikely to help with that...

I didn't see any mention of Rust in the article?

PreInternet01

2 months ago

[flagged]

bigfatkitten

2 months ago

They wrote their own BMC and various other bits and pieces in Rust. That's an extremely tiny part of the whole picture.

steveklabnik

2 months ago

It’s significantly more than that, but it’s also true that we include stuff in other languages where appropriate. CockroachDB is in Go, and illumos is in C, as two examples. But almost all new code we write is in Rust. That is the stuff you’re talking about, but also like, our control plane.

Oh and we write a lot of Typescript too.

rcxdude

2 months ago

I think it's hard to call it a reason. It is a tool which fits in with the philosophy of the company in terms of how to achieve it's goals, but I think it would still exist if rust didn't. I would describe the goal as making a hyperscaling system that can be sold as a product, the philosophy of how to make this is an aggressive focus on integration, openness, and quality, and that rust is a language that works well with the last two of those goals.

sam_bristow

2 months ago

It's also not really a case of "rewriting in Rust" anyway, it's more just "writing it in Rust" since most of the stuff the Oxide team has built is greenfield work.

mycoliza

2 months ago

We also sell computers... :)

transpute

2 months ago

OSS Rust in Rack trenchcoat.

sophacles

2 months ago

That's an interesting take. What's your reasoning? Whats your evidence?

0x457

2 months ago

Pretty much everything Oxide publishes on github is either in rust or it's an sdk to service in rust. Well and web panel isn'tin rust, so negative points for that, true evangelists would have used WASM.

But Oxide reason to exist is to keep memory of cool racks from Sun running Solaris alive forever.

murderfs

2 months ago

The raison d'être of Oxide isn't Rust, it's continuing to pretend that the bloated corpse of Solaris still has some signs of life.

shrubble

2 months ago

18MW/year is not a real unit of measurement; did you mean MWh?

einpoklum

2 months ago

> How can organizations reduce power consumption and corresponding carbon emissions?

Stop running so much useless stuff.

Also maybe ARM over x86_64 and similar power-efficiency-oriented hardware.

Rack-level system design, or at least power & cooling design, is certainly also a reasonable thing to do. But standardization is probably important here, rather than some bespoke solution which only one provider/supplier offers.

> How can organizations keep pace with AI innovation as existing data centers run out of available power?

Waste less energy on LLM chatbots?

zamadatix

2 months ago

Current ARM servers actually generally offer "on par" (varies by workload) perf/Watt for generally worse absolute performance (varies by workload) i.e. require more other overhead to achieve the same total perf despite "on par" perf/Watt.

Need either Apple to get into the general market server business or someone to start designing CPUs as well as Apple (based on the comparison between different ARM cores I'm not sure it really matters if they do so using a specific architecture or not).

p_l

2 months ago

It's more a case of selection of optimization parameters and corresponding economy. It's not so much that apple towers over others in design (though they are absolutely no slouches and have wins there) but their design team is in position to coordinate with product directly and as such isn't as limited by "but will it sell in high enough numbers for the excel sheet at investor's desk?"

The real show stopper for years is that ARM servers are just not prepared to be a proper platform. uBoot with grudgingly included FDT (after getting kicked out of Linux kernel) does not make a proper platform, and often there's also no BMC, unique approaches to various parts making the server that one annoying weirdo in the data center, etc.

Cloud providers can spend the effort to backfill necessary features with custom parts, but doing so on your own on-prem is hard

zamadatix

2 months ago

Not sure what you mean wrt to Apple's uniqueness. AMD/Mediatek/Intel/Qualcomm/Samsung only make margin on how well they invest on their designs vs their competitors and they'd all love to be outshipping each other and Apple in any market. All, including Apple, also rely on the same manufacturer for their top products and the ones (Intel/Samsung) with alternatives have not been able to use that as an advantage for top performing products. Sure, Apple can work directly with their own product... but at the end of the day the goal and available customer pool to fight over is the same and they still ship fewer units than the others.

I'm not hands-on familiar with other serious ARM server market players but for several years now Ampere ARM server CPUs at least are nothing like you describe. Phoronix says it best in https://www.phoronix.com/review/linux-os-ampereone

> All the Linux distributions I attempted worked out effortlessly on this Supermicro AmpereOne server. Like with Ampere Altra and Ampere eMAG before that, it's a seamless AArch64 Linux experience. Thanks to supporting open standards like UEFI, Arm SBSA/SBBR and ACPI and not having to rely on DeviceTrees or other nuisances, installing an AArch64 Linux distribution on Ampere hardware is as easy as in the x86_64 space.

p_l

2 months ago

Ampere is a bright spot in all of this, indeed. Just considerably late. I remember being bombarded by "ARM servers are going to eat the world" in 2013, but ARM couldn't deliver SBSA in shape that would make it possible and to this day I am left with serious doubts if any ARM board will work out right (there are bright spots though).

As for Apple "uniqueness", I met a lot of people who think that Apple "just" has so much better design team, when it's similar to what you say and the unique part is them being able to properly narrow their design space instead of chasing cost-conscious manufacturers.