Serving AI from the Basement – 192GB of VRAM Setup

318 pointsposted 11 days ago
by XMasterrrr

115 Comments

XMasterrrr

11 days ago

Hey guys, this is something I have been intending to share here for a while. This setup took me some time to plan and put together, and then some more time to explore the software part of things and the possibilities that came with it.

Part of the main reason I built this was data privacy, I do not want to hand over my private data to any company to further train their closed weight models; and given the recent drop in output quality on different platforms (ChatGPT, Claude, etc), I don't regret spending the money on this setup.

I was also able to do a lot of cool things using this server by leveraging tensor parallelism and batch inference, generating synthetic data, and experimenting with finetuning models using my private data. I am currently building a model from scratch, mainly as a learning project, but I am also finding some cool things while doing so and if I can get around ironing out the kinks, I might release it and write a tutorial from my notes.

So I finally had the time this weekend to get my blog up and running, and I am planning on following up this blog post with a series of posts on my learnings and findings. I am also open to topics and ideas to experiment with on this server and write about, so feel free to shoot your shot if you have ideas you want to experiment with and don't have the hardware, I am more than willing to do that on your behalf and sharing the findings

Please let me know if you have any questions, my PMs are open, and you can also reach me on any of the socials I have posted on my website.

mattnewton

11 days ago

The main thing stopping me from going beyond 2x 4090’s in my home lab is power. Anything around ~2k watts on a single circuit breaker is likely to flip it, and that’s before you get to the costs involved of drawing that much power for multiple days of a training run. How did you navigate that in a (presumably) residential setting?

tcdent

11 days ago

I can't believe a group of engineers are so afraid of residential power.

It is not expensive, nor is it highly technical. It's not like we're factoring in latency and crosstalk...

Read a quick howto, cruise into Home Depot and grab some legos off the shelf. Far easier to figure out than executing "hello world" without domain expertise.

pupdogg

11 days ago

You can run a setup of 8x 4090 GPUs using 4x 1200W 240V power supplies (preferably HP HSTNS-PD30 Platinum Series), with a collective use of just around 20-amps, meaning it can easily run on a single 240V 20-amp breaker. This should be easily doable in a home where you typically have a 100 to 200A main power panel. Running 4x 1200W power supplies 24 hours a day will consume 115.2 kWh per day. At an electricity rate of $0.12 per kWh, this will cost approximately $13.82 per day or around $414.72 per month.

FYI, I can handle electrical system design and sheet metal enclosure design/fabrication for these rigs, but my software knowledge is limited when it comes to ML. If anyone's interested, I'd love to collaborate on a joint venture to produce these rigs commercially.

orbital-decay

11 days ago

>Anything around ~2k watts on a single circuit breaker is likely to flip it

I'm curious, how do you use e.g. a washing machine or an electric kettle, if 2kW is enough to flip your breaker? You should simply know your wiring limits. Breaker/wiring at my home won't even notice this.

throwthrowuknow

11 days ago

Not speaking from direct experience building a rig like this but the blog post mentions having 3 power supplies so the most direct solution would be to put each on their own dedicated circuit. As long as you have space in your electrical box this is straightforward to do though I would recommend having an electrician do the wiring if you aren’t experienced with that type of home electrical work.

bluedino

11 days ago

Take your typical 'GPU node', which would be a Dell/HP/SuperMicro with 4-8 NVIDIA H100's and a single top high level AMD/Intel CPU. You would need 2-4 240v outlets (30A).

In the real world you would plug them into a PDU such as: https://www.apc.com/us/en/product/AP9571A/rack-pdu-basic-1u-...

Each GPU will take around 700W and then you have the rest of the system to power, so depending on CPU/RAM/storage...

And then you need to cool it!

fennecbutt

10 days ago

Is suppose this is an American view. Most places with 240 you can run anything up to 3kW per socket most of the time. But you can also get a sparky and go for a cheap high current socket install on 240 or even pay a bit more to get 3 phase installed, if you have a valid enough use case.

Hell most kettles use 3kw. Tho for a big server I'd get it wired dedicated, same way power showers are done (7-12~ kW)

abound

11 days ago

Not OP, but my current home had a dedicated 50A/240V circuit because the previous owner did glass work and had a massive electric kiln. I can't imagine it was cheap to install, but I've used it for beefy, energy hungry servers in the past.

Which is all to say its possible in a residential setting, just probably expensive.

slavik81

11 days ago

Not the OP, but I hired an electrician to put a 30A 240V circuit with a NEMA L6-30R recepticle next to my electrical panel. It was 600 CAD. You can probably get it done cheaper. He had to disconnect another circuit and make a trip to the hardwate store because I told him to bring the wrong breaker.

GaggiX

11 days ago

I use a hair dryer that is a little bit more than 2kw, but I guess because of the 120V it would be a problem in the US.

16 amps x 120v = 1920W, it would probably trip after several minutes.

16 amps x 230v = 3680W, it wouldn't trip.

sandos

11 days ago

This is funny as a european, since we have many, many groups where we reguarly will run 2kW, and some, loads. Really no issue, but I guess lower voltage makes it a problem.

teaearlgraycold

11 days ago

I’ve ran 3x L40S on a 1650W PSU on a normal 120V 20A circuit.

littlestymaar

11 days ago

Then juste add a 32A circuit breaker to your electrical installation, it's not a big deal really.

user

11 days ago

[deleted]

XMasterrrr

11 days ago

Oh yeah, my original setup was an RTX 4090 + an RTX 3090, and I swear one night I had the circuit breaker trip more than 15 times before I gave up. I have a UPS so I would run to the box before my system shuts down. Most houses are equipped with 15amp 120v breakers, these should never exceed 1500w, and their max is 1800w but then you're really risking it.

So, as mentioned on the article, I actually have installed (2) 30amp 240v breakers dedicated entirely for this setup (and the next one in case I decide to expand to 16x GPUs over 2 nodes lol). Each breaker is supposed to power up to 6000w at ease. I also installed a specific kind of power outlet that can handle that kind of current, and I have done some extreme research into PDUs. I plan on posting about all of that in this series (part 3 according to my current tentative plans) so stay tuned and maybe bookmark the website/add the RSS feed to your digest/or follow me on any of the socials if this is something that you wanna nail down without spending a month on research like me :'D

nullindividual

11 days ago

Do you run this 24/7?

What is your cost of electricity per kilowatt hour and what is the cost of this setup per month?

michaelt

11 days ago

I have a much smaller setup than the author - a quarter the GPUs and RAM - and I was surprised to find it draws 300W at idle

trollbridge

11 days ago

This is a setup that might make more sense to run at full power during winter months.

nrp

11 days ago

How are you finding 2b/3b quantized llama 405B? Is it behaving better than 8b or 16b llama 70B?

pupdogg

11 days ago

Amazing setup. I have the capability to design, fabricate, and powder coat sheet metal. I would love to collaborate on designing and fabricating a cool enclosure for this setup. Let me know if you're interested.

koyote

11 days ago

This is undoubtedly cool and I am a bit jealous!

Maybe a bit of a stupid question, but what do you actually do with the models you run/build, a part from tinkering? I'd assume most tinkering can also be done on smaller systems? Is it in order to build a model that is actually 'useful'/competitive?

faangguyindia

11 days ago

I tried self hosting LLM for commandline instant completion and guidance utility: https://github.com/zerocorebeta/Option-K

But problem is even 7b models are too slow on my pc.

Hosted models are lightening fast. I considered possibility of buying hardware but decided against it.

bravura

11 days ago

How loud is it? Was special electrical needed?

lossolo

11 days ago

Cool, it looks similar to my crypto mining rigs (8xGPU per node) from around 7 years ago, but I used PCI-E risers and a dual power supply.

TaylorAlexander

11 days ago

[flagged]

sva_

11 days ago

A single 3090 will deliver more tflops than the m2 ultra.

bongodongobob

11 days ago

He's got 8x3090s are you fucking kidding? Like is this some kind of AI reply?

"Wow great post! I enjoy your valuable contributions. Can you tell me more about graphics cards and how they compare to other different types of computers? I am interested and eager to learn! :)"

wkat4242

11 days ago

> And who knows, maybe someone will look back on my work and be like “haha, remember when we thought 192GB of VRAM was a lot?”

I wonder if this will happen. It's already really hard to buy big HDDs for my NAS because nobody buys external drives anymore. So the pricing has gone up a lot for the prosumer.

I expect something similar to happen to AI. The big cloud parties are all big leaders on LLMs and their goal is to keep us beholden to their cloud service. Cheap home hardware work serious capability is not something they're interested in. They want to keep it out of our reach so we can pay them rent and they can mine our data.

Eisenstein

11 days ago

It isn't that cloud providers want to shut us out, it is that nVidia wants to relegate AI capable cards to the high end enterprise tier. So far in 2024 they have made $10.44b in revenue from the gaming market, and over $47.5b in the datacenter market, and I would bet that there is much less profit in gaming. In order to keep the market segmented they stopped putting nvlink on gaming cards and have capped VRAM at 24GB for the highest end GPUs (3090 and 4090) and it doesn't look much better for the upcoming 5090. I don't blame them, they are a profit-maximizing corporation after all, but if anything is to be done about making large AI models practical for hobbyists, start with nVidia.

That said, I really don't think that the way forward for hobbyists is maxing VRAM. Small models are becoming much more capable and accelerators are a possibility, and there may not be a need for a person to run a 70billion parameter model in memory at all when there are MoEs like Mixtral and small capable models like phi.

Saris

11 days ago

>It's already really hard to buy big HDDs for my NAS because nobody buys external drives anymore. So the pricing has gone up a lot for the prosumer.

I buy refurb/used enterprise drives for that reason, generally around $12 per TB for the recent larger drives. And around $6 per TB for smaller drives. You just need an SAS interface but that's not difficult or expensive.

IE; 25TB for $320, or 12TB for $80.

thelastparadise

11 days ago

> It's already really hard to buy big HDDs for my NAS

IME 20tb drives are easy to find.

I don't think the clouds have access to bigger drives or anything.

Similarly, we can buy 8x A100s, they're just fundamentally expensive whether you're a business or not.

There doesn't seem to be any "wall" up like there used to be with proprietary hardware.

wkat4242

11 days ago

They are easy to find but extremely expensive. I used to pay below 200€ for a 14TB Seagate 8 years ago. That's now above 300. And the bigger ones are even more expensive.

For me these prices are prohibitive. Just like the A100s are (though those are even more so of course).

The problem is the common consumer relying on the cloud so these kind of products become niches and lose volume. Also, the cloud providers don't pay what we do for a GPU or HDD. They buy them by the ten thousands and get deep discounts. That's why the RRPs which we do pay are highly inflated.

gizmo686

11 days ago

The cloud companies do not make the hardware, they buy it like the rest of us. They are just going to be almost the entirety of the market, so naturally the products will built and priced with that market in mind.

wkat4242

11 days ago

Yes and they get deep discounts which we don't. Can be 40% or more!

Of course the vendor can't make a profit with such discounts so they inflate the RRP. But we do end up paying that.

walterbell

11 days ago

An adjacent project for 8 GPUs could convert used 4K monitors into a borderless mini-wall of pixels, for local video composition with rendered and/or AI-generated backgrounds, https://theasc.com/articles/the-mandalorian

> the heir to rear projection — a dynamic, real-time, photo-real background played back on a massive LED video wall and ceiling, which not only provided the pixel-accurate representation of exotic background content, but was also rendered with correct camera positional data.. “We take objects that the art department have created and we employ photogrammetry on each item to get them into the game engine”

freeqaz

11 days ago

How much do the NVLinks help in this case?

Do you have a rough estimate of how much this cost? I'm curious since I just built my own 2x 3090 rig and I wondered about going EPYC for the potential to have more cards (stuck with AM5 for cheapness though).

All in all I spent about $3500 for everything. I'm guessing this is closer to $12-15k? CPU is around $800 on eBay.

lvl155

11 days ago

My reason for going Epyc was for Pcie lanes and cheaper enterprise SSDs via U.3/2. With AM5, you tap out the lanes with dual GPUs. Threadripper is preferable but Epyc is about 1/2 of the price or even better if you go last gen.

Eisenstein

11 days ago

Why do you need such high cross card bandwidth for inference? Are you hosting for a lot of users at once?

darknoon

10 days ago

I tried this w/ AM5, but realized that despite there theoretically being enough lanes for dual x16 PCI-e 4.0 GPUs, I couldn't find any motherboards that are actually configured this way, since dual-GPU is dead in consumer for gaming.

Tepix

11 days ago

I built this in early 2023 out of used parts and ended up with a cost of 2300€ for AM4/128GB/2x3090 @ PCIe 4.0x8 +nvLink

RockRobotRock

11 days ago

I haven't been able to find a good answer on what difference NVLink makes or which applications support it.

fragmede

11 days ago

NVLink is what makes multiGPU work. It lets the GPUs talk to each other across a high bandwidth (600 Gbps), low latency link. Tensorflow and PyTorch both support it, among other things. It's not this weird thing that's a side note, the interconnect between nodes is what makes a supercomputer super. You don't hear about it much because you don't hear about a lot of details of supercomputer stuff in mainstream media.

modeless

11 days ago

I wonder how the cost compares to a Tinybox. $25k for 6x 4090 or $15k for 6x 7900XTX. Of course that's the full package with power supplies, CPU, storage, cooling, assembly, shipping, etc. And a tested, known good hardware/software configuration which is crucial with this kind of thing.

Tepix

11 days ago

If you merely want CUDA and lots of VRAM there‘s no reason to pick expensive 4090s over used 3090s

halJordan

10 days ago

Well there is and it's called performance. You dont have to push your version of what an appropriate price/performance ratio is

angoragoats

11 days ago

You can build a setup like in the OP for somewhere around $10k, depending on several factors, the most important of which are the price you source your GPUs at ($700 per 3090 is a reasonable going rate) and what CPU you choose (high core count, high frequency Epyc CPUs will cost more).

itomato

11 days ago

With a rental option coming, it’s hard for me to imagine a more profitable way to use a node like that.

choilive

11 days ago

I have a similar setup in my basement! Although its multiple nodes, with a total of 16x3090s. Also needed to install a 30A 240V circuit as well.

lvl155

11 days ago

That last part is often overlooked. This is also why sometimes it’s just not worth going local especially if you don’t need all that compute power beyond a few days.

buildbot

11 days ago

100% agree, anything beyond 4x gpu’s is getting into the very annoying to power territory and makes the cloud very attractive. I already can trip a 15A circuit on 115v power with just 3x4090s and a SPR-X cpu.

It also costs a lot to power. In the summer, 2x more than you expect, because unless it’s outside, you need cool 1000+ watts of extra heat with your AC. All that together and runpod starts to look very tempting!

flixf

11 days ago

Very interesting! How are the 8 GPUs connected to the motherboard? Based on the article and the pictures, he doesn't appear to be using PCIe risers.

I have a setup with 3 RTX 3090 GPUs and the PCIe risers are a huge source of pain and system crashes.

lbotos

11 days ago

I had the same question. I was curious what retimers he was using.

I've had my eye on these for a bit https://c-payne.com/

system2

11 days ago

Typical crypto miner setup. I had two 6GPU setups with 1200W PSUs and 6 PCIE slots with PCI extender cables. Its value dropped harder than a cyber truck's after a few months.

The worst thing is dust. They would accumulate so much every week I had to blow the dust off with an air compressor.

Electricity cost was around $4 a day (24 x $0.20~). If online GPU renting is more expensive, maybe the initial cost could be justifiable.

Havoc

11 days ago

> Typical crypto miner setup.

Except not doing the sketchy x1 pcie lanes. That’s the part that makes nice LLM setups hard

system2

11 days ago

Can you tell me what's sketchy about it? I have not had an issue with any one of the 12 extenders and bandwidth held well without any issues. Please explain if possible if LLM requires a different type of extender.

killingtime74

11 days ago

Did everyone just miss the fact that the post says the intention is to run Llama 3 405b but it has less than 1/4 of the VRAM required to do so? Did you just change your goals mid build? It's commonly known how much ram is required for a certain parameter size.

nathanasmith

7 days ago

The system has 512 GB of RAM so while it'll be slower at inference, he really has about 704 GB at his disposal to run the model assuming he distributes the weights across the VRAM and system RAM.

schaefer

11 days ago

Amazing writeup. And what a heavy hitter of an inaugural blog entry...

This might be the right time to ask: So, on the one hand, this is what it takes to pack 192gb of Nvidia flavored vram into a home server.

I'm curious, is there any hope of doing any interesting work on a MacBook Pro Which currently can be max-spaced at 128 GB of unified memory (for the low, low price of $4.7k).

I know there's no hope of running cuda on the macbook, and I'm clearly out of my depth here. But the possibly naive day-dream of tossing a massive LLM into a backpack is alluring...

Eisenstein

11 days ago

Download kobodlcpp and give it a try. It is a single exec and uses metal acceleration with an Apple Arm CPU.

sireat

10 days ago

I was under the mistaken impression that you could not go beyond 2x3090 for reasonable inference speed.

My assumption was that going beyond 2 cards incurs significant bandwidth penalty when going from NVLink between 2x3090s to PCIe for communicating between the other 3090s.

What kind of T/s speeds are you getting with this type of 8x3090 setup?

Presumably then even crazier 16x4090 would be an option for someone with enough PCIe slots/risers/extenders.

SmellTheGlove

11 days ago

I thought I was balling with my dual 3090 with nvlink. I haven’t quite yet figured out what to do with 48GB VRAM yet.

I hope this guy posts updates.

lxe

11 days ago

Run 70B LLM models of course

3eb7988a1663

11 days ago

What is the power draw under load/idle? Does it noticeably increase the room temperature? Given the surroundings (aka the huge pile of boxes behind the setup), curious if you could get away with just a couple of box fans instead of the array of case fans.

Are you intending to use the capacity all for yourself or rent it out to others?

NavinF

11 days ago

Box fans are surprisingly power hungry. You'd be better off using large 200mm PC fans. They're also a lot quieter

michaelt

11 days ago

If you care about noise, I also recommend not getting 8 GPUs with 3 fans each :)

illiac786

11 days ago

I dream from a future where the „home server with heat recuperation“ appliance will be common enough I can get a worker to install it physically for me - I have little electrical skills and zero plumbing skills. And I also hope that by then power consumption will have gone down.

maaaaattttt

11 days ago

Looking forward to reading this series.

As a side note I’d love to find a chart/data on the cost performance ratio of open source models. And possibly then a $/ELO value (where $ is the cost to build and operate the machine and ELO kind of a proxy value for the average performance of the model)

renewiltord

11 days ago

I have a similar one with 4090s. Very cool. Yours is nicer than mine where I've let the 4090s rattle around a bit.

I haven't had enough time to find a way to split inference which is what I'm most interested in. Yours is also much better with the 1600 W supply. I have a hodge podge.

deoxykev

10 days ago

Are you able to run 405B? 4Bit quant vram requirements are just shy of 192GB.

Tepix

11 days ago

So, how do you connect the 8th card if you have 7 PCIe 4.0 x16 slots available?

manav

11 days ago

PCIe bifurcation - so splitting one of the x16 slots into two x8 or similar.

metadat

11 days ago

Worth mentioning - this also cuts the available bandwidth to each card by 50%.

tshadley

10 days ago

"Why PCIe Risers suck and the importance of using SAS Device Adapters, Redrivers, and Retimers for error-free PCIe connections."

I'm a believer! Can't wait to hear more about this.

elorant

11 days ago

The motherboard has 7 PCie slots and there are 8 GPUs. So where does the spare one connect to? Is he using two GPUs in the same slot limiting the bandwidth?

ganoushoreilly

11 days ago

may be using an nvme to pcie adapter, common in the crypto mining world

buildbot

11 days ago

It’s an epyc server board, it probably has actual U.2/MCIO pcie ports on the board that can be merged back into a 16x slot in the bios. I had/have several boards like that.

lowbloodsugar

11 days ago

Sometimes I think about dropping $10k to $20k on a rig like this and then I remember I can rent 8xH100s and 8xA100s with 640GB VRAM for $20/hr.

InsomniacL

11 days ago

When you moved in to your house, did you think you would finish a PC build with 192GB of VRAM before you would finish the plaster boarding?

killingtime74

11 days ago

Maybe they removed it for better ventilation

LetsGetTechnicl

11 days ago

Just an eye watering amount of compute, electricity and money just to run LLM's... this is insane. Very cool though!

bogwog

11 days ago

Awesome! I've always wondered what something like this would look like for a home lab.

I'm excited to see your benchmarks :)

Havoc

11 days ago

Very cool. But also bit pricey unless you can actually utilize it 24/7 in some productive fashion

throwpoaster

11 days ago

Did you write this with the LLM running on the rig?

emptiestplace

11 days ago

Does this post actually seem LLM generated to you?

throwpoaster

11 days ago

It reads like an LLM draft with a human edit, yes.

cranberryturkey

11 days ago

this is why we need an actual AI blockchain, so we can donate GPU and earn rewards for the p2p api calls using the distributed model.

walterbell

11 days ago

> donate GPU .. earn rewards

Is a blockchain needed to sell unused GPU capacity?

bschmidt1

11 days ago

That's actually interesting. While crypto GPU mining is "purposeless" or arbitrary, would be way cooler if to GPU mine meant to chunk through computing tasks in a free/open queue (blockchain).

Eventually there could be some tipping point where networks are fast enough and there are enough hosting participants it could be like a worldwide/free computing platform - not just for AI for anything.

yunohn

11 days ago

This idea has been brought up tons of times by grifters aiming to pivot from Crypto to AI. The reason that GPUs are used for blockchains is to compute large numbers or proofs - which are truly useless but still verifiable so they can be distributed and rewarded. The free GPU compute idea misses this crucial point, so the blockchain part is (still) useless unless your aim is to waste GPU compute instead.

IRL all you need is a simple platform to pay and schedule jobs on other’s GPUs.

kcb

11 days ago

Problem is once you have to scale to multiple GPUs the interconnect becomes the primary bottleneck.

rvnx

11 days ago

You could just buy a Mac Studio for 6500 USD, have 192 GB of unified RAM and have way less power consumption.

lvl155

11 days ago

This is something people often say without even attempting to do a major AI task. If Mac Studio were that great they’d be sold out completely. It’s not even cost efficient for inference.

vunderba

11 days ago

I'm seeing this misunderstanding a lot recently. There's TWO components to putting together a viable machine learning rig:

- Fitting models in memory

- Inference / Training speed

8 x RTX 3090s will absolutely CRUSH a single Mac Studio in raw performance.

angoragoats

11 days ago

You could for sure, but the nVidia setup described in this article would be many times faster at inference. So it’s a tradeoff between power consumption and performance.

Also, modern GPUs are surprisingly good at throttling their power usage when not actively in use, just like CPUs. So while you need 3kW+ worth of PSU for an 8x3090 setup, it’s not going to be using anywhere near 3kW of power on average, unless you’re literally using the LLM 24x7.

exyi

11 days ago

Even if you are running it constantly, the per token power consumption is likely going to be in a similar range, not to mention you'd need 10+ macs for the throughput.

robotnikman

11 days ago

I have a 3090 power capped at 65%, I only notice a minimal difference in performance

steve_adams_86

11 days ago

I know it's a fraction of the size, but my 32GB studio gets wrecked by these types of tasks. My experience is that they're awesome computers in general, but not as good for AI as people expect.

Running llama3.1 70B is brutal on this thing. Responses take minutes. Someone running the same model on 32GB of GPU memory seems to have far better results from what I've read.

irusensei

11 days ago

You are probably swapping. On M3 max with similar memory bandwidth the output is around 4t/s which is normally on par with most people's reading speed. Try different quants.

flemhans

11 days ago

Are people running llama 3.1 405B on them?

rspoerri

11 days ago

I'm running 70B models (usually in q4 .. q5_k_m, but possible to q6) on my 96Gbyte Macbook Pro with M2-Max (12 cpu cores, 38 gpu cores). This also leaves me with plenty of ram for other purposes.

I'm currently using reflection:70b_q4 which does a very good job in my opinion. It generates with 5.5 tokens/s for the response, which is just about my reading speed.

edit: I usually dont run larger models (q6) because of the speed. I'd guess a 405B model would just be awfully slow.

kcb

11 days ago

and have way less power