hackernews client

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code

201 pointsposted 4 months ago

46 Comments

bahmboo

4 months ago

I see a lot of snark in the comments. Simon is a researcher and I really like seeing his experiments! Sounds like the goal here was to delegate a discrete task to an LLM and have it solve the problem much like one would task a junior dev to do the same.

And like a junior dev it ran into some problems and needed some nudges. Also like a junior dev it consumed energy resources while doing it.

In the end I like that the chunk size of work that we can delegate to LLMs is getting larger.

fumeux_fume

4 months ago

Another top comment that fabricates negativity saying “I see a lot of snark in the comments.” Hardly anything negative, let alone snarky, is being directed to one of HNs most popular voices. It’s a weird dark pattern to get upvotes under the pretext of being supportive and it should be called out more often.

bahmboo

4 months ago

Huh? My post was very positive and hopefully let's not get too meta. I felt the comments did not address the merits or contents of what was being presented. That's all!

edit: did you see my comment when it was first posted? The topic was being dumped on at that time and drowning out the signal. I promise I'm not part of a dark pattern.

CaptainOfCoit

4 months ago

> I felt the comments did not address the merits or contents of what was being presented

That's wildly different from:

> I see a lot of snark in the comments

The first is essentially "People should talk about the actual content more" while the second is "People are mocking the submission". If no other top-level comments actually seem like mocking, it seems fair that someone reacts to the second sentiment

> The topic was being dumped on at that time and drowning out the signal.

There was a whole of 7 comments when you made yours (https://ditzes.com/item/45646559), most of them children comments, none of them mocking the submission nor the author.

Upvoter33

4 months ago

No offense, but I hate all the comparisons to a "junior dev" that I see out there. This process is just like any dev! I mean, who wouldn't have to tinker around a bit to get some piece of software to work? Is there a human out there who would just magically type all the right things - no errors - first try?

solumos

4 months ago

> And like a junior dev it ran into some problems and needed some nudges.

There are people who don't get blocked waiting for external input in order to get tasks like this done, which I think is the intended comparison. There's a level of intuition that junior devs and LLMs don't have that senior devs do.

the-grump

4 months ago

To offer a counterpoint, I had much better intuition as a junior than I do now, and it was also better than the seniors on my team.

Sometimes looking at the same type of code and the same infra day in and day out makes you rusty. In my olden days, I did something different every week, and I had more free time to experiment.

baq

4 months ago

Hobby coding is imho a high entropy signal that you joined the workforce with a junior title but basically senior experience, which is what I see from kids who learned programming young due to curiosity vs those who only started learning in university. IOW I suspect you were not a junior in anything but name and pay.

There’s also a factor of the young being very confident that they’re right ;)

fastball

4 months ago

So you are a worse dev now than you were before? Have you asked for a pay cut from your employer?

baq

4 months ago

https://en.wikipedia.org/wiki/Peter_principle

;)

the-grump

4 months ago

Believe it or not, my employer likes what I'm doing so I'm still on the promotion track.

They seem more concerned with my ability to work on the company's bread and butter.

fastball

4 months ago

And you are better at working on the company's bread and butter with worse intuition?

arthurcolle

4 months ago

pay increase - with better tools, I'd imagine

conradev

4 months ago

Codex is actually pretty good at getting things working and unblocking itself.

It’s just that when I review the code, I would do things differently because the agent doesn’t have experience with our codebase. Although it is getting better at in-context learning from the existing code, it is still seeing all of it for the “first time”.

It’s not a junior dev, it’s just a dev perpetually in their first week at a new job. A pretty skilled one, at that!

and a lot of things translate. How well do you onboard new engineers? Well written code is easier to read and modify, tests helps maintain correctness while showing examples, etc.

bahmboo

4 months ago

Point taken and I should have known better. I fully agree with you. I suppose I should say inexperienced dev or something more accurate. Having worked with many inexperienced devs there was quite a spread in capabilities. Using terms that are dismissive to individuals is not helpful.

pedrosorio

4 months ago

> Is there a human out there who would just magically type all the right things - no errors - first try?

If they know what they're doing and it's not an exploratory task where the most efficient way to do it is by trial and error? Quite a few. Not always, but often.

That skill seems to have very little value in today's world though.

emporas

4 months ago

The point of DeepSeek-OCR is not an one way image recognition and textual description of the pixels, but text compression to a generated image, and text restoration from that generated image. This video explains it well [1].

> From the paper: Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20×, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs.

It's main purpose is: a compression algorithm from text to image, throw away the text because it costs too many tokens, keep the image in the context window instead of the text, generate some more text, when text accumulates even more compress the new text to image and so on.

The argument is, pictures store a lot more information than words, "A picture is worth a thousand words" after all. Chinese characters are pictograms, it doesn't seem that strange to think that, but I don't buy it.

I am doing some experiments of removing text as an input for LLMs and replacing with it's summary, and I have reduced the context window by 7 times already. I am still figuring what it the best way to achieve that, but 10 times is not far off. My experiments involve novel writing not general stuff, but still it works very well just replacing text with it's summary.

If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!

[1] https://www.youtube.com/watch?v=YEZHU4LSUfU

CaptainOfCoit

4 months ago

> If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!

I mean, some really dense languages basically do, like APL using symbols we (non-mathematicians) rarely even see.

emporas

4 months ago

Combinators are math though. There is a section in the paper that covers the topic of graphs and charts, transforming them to text and then back again to image. They claim 97% precision.

> within a 10× compression ratio, the model’s decoding precision can reach approximately 97%, which is a very promising result. In the future, it may be possible to achieve nearly 10× lossless contexts compression through text-to-image approaches.

Graphs and charts should be represented as math, i.e. text, that's what they are anyway, even when they are represented as images, it is much more economical to be represented as math.

The function f(x)=x can be represented by an image of (10pixels x 10pixels) dimensions, (100pixels x 100pixels) or (infinite pixels x infinite pixels).

A math function is worth infinite pictures.

qingcharles

4 months ago

I did the opposite yesterday. I used GPT5 to brute force dotnet into Claude Code for Web, which finally involved it writing an entire HTTP proxy in Python to download nuget packages.

cat_plus_plus

4 months ago

No idea why Nvidia has such crusty torch prebuilds on their own hardware. Just finished installing unsloth on a Thor box for some finnetuning, it's a lengthy build marathon, thankfully aided by Grok giving commands/environment variables for the most part (one finishing touch is to install latest CUDA from nvidia website and then replace compiler executables in triton package with newer ones from CUDA).

ur-whale

4 months ago

> No idea why Nvidia has such crusty torch prebuilds on their own hardware.

NVidia is a hardware company at heart.

They do create and sell amazing hardware.

But like almost all hardware makers I know, they totally suck at software. Their ecosystem is an effing nightmare (drivers, compilers, etc...). It's a pure culture issue, where:

   a) software always comes as an afterthought

   b) the folks in charge of engineering are largely HW folks. They like like HW engineers and the resulting software stack looks exactly like a piece of silicon: opaque, static, inflexible and most important of all, never designed to be understood/looked at/reworked.

I suspect the reason why all their software is closed-source is not for commercial reasons, they're just ashamed as a company to show the world how shitty their SWE skills are.

htrp

4 months ago

serious q, why grok vs another frontier model?

cat_plus_plus

4 months ago

Grok browses a large number of websites for queries that need recent information, which is super handy for new hardware like Thor.

vunderba

4 months ago

From the article:

> Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.

Given the importance of TDD in this style of continual agentic loop - I was a bit surprised to see that the author only seems to have provided an input but not an actual expected output.

Granted this is more difficult with OCR since you really don't know how well DeepSeek-OCR might perform, but a simple Jaccard sanity test between a very legible input image and expected output text would have made it a little more hands-off.

EDIT: After re-reading the article, I guess this was more of a test to see if DeepSeek-OCR would run at all. But I bet you could setup a pretty interesting TDD harness using the aforementioned algorithm with an LLM in a REPL trying to optimize Tesseract parameters against specific document types which was ALWAYS such a pain in the past.

amirhirsch

4 months ago

I also use Claude Code to install CUDA and PyTorch and HuggingFace models on my quad A100 machine. Shouldn't feel like debugging a 2000s Linux driver.

HuggingFace has incredible reach but poor UX, and PyTorch installs remain fragile. There’s real space here for a platform that makes this all seamless maybe even something that auto-updates a local SSD with fresh models to try every day.

nicman23

4 months ago

what. i mean firmware has a lot of failsafes but i would not trust it to do sysops for a 100k machine

smokel

4 months ago

It is indeed a disaster waiting to happen. Figuring out a way to run agents safely is apparently a harder problem than getting to AGI :)

qwertytyyuu

4 months ago

thus is the AI alignment problem

amirhirsch

4 months ago

it’s all running inside a Proxmox VM with IOMMU and GPU passthrough. It’s as safe as doing the same on any cloud system.

Also the machine is well north of 100K when you include the RF ADCs and DACs in there that run a radar.

Worst case, I have multiple.

nicman23

3 months ago

iommu does not matter ? the issue would be if one of those a100 fries.

achr2

4 months ago

My assumption is that there could be higher dimensional 'token' representations that can be used instead of vision tokens (though obviously not human interpretable). I wonder if there is an optimisation for compression that takes the vector space at a specific level in the network to provide the most context for minimal memory space.

user

4 months ago

[deleted]

BoredPositron

4 months ago

Compute well spent... finding out to download a version and hardware appropriate wheel.

prodigycorp

4 months ago

Don't ask how many human compute hours are spent figuring this out.

Zopieux

4 months ago

Gotta keep the hype up!

hkt

4 months ago

> There’s honestly so much material in the resulting notes created by Claude that I haven’t reviewed all of it

I've had the same "problem" and feel like this is the major hazard involved. It is tricky to validate the written work Claude (or any other LLM) produces due to high levels of verbosity, and the temptation to say "well, it works!"

As ever though, it is impressive what we can do with these things.

If I were Simon, I might have asked Claude (as a follow up) to create a minimal ansible playbook, or something of that nature. That might also be more concise and readable than the notes!

varispeed

4 months ago

I am the only one seeing this Nvidia Spark as meh?

I had it in my cart, but then watched few videos from influencers and it looks like power of this thing doesn't match the hype.

dumbmrblah

4 months ago

For inference might as well get a strix halo for half the price.

throwaway48476

4 months ago

Its also going to be unsupported after a few years.

syntaxing

4 months ago

Ehh, is it cool and time savings that it figured it out? Yes. But the solution was to get a “better” version prebuilt wheel package of PyTorch. This is a relatively “easy” problem to solve (figuring out this was the problem does take time). But it’s (probably, I can’t afford one) going to be painful when you want to upgrade the cuda version or specify a specific version. Unlike a typical PC, you’re going to need to build a new image and flash it. I would be more impressed when a LLM can do this end to end for you.

sh3rl0ck

4 months ago

Pytorch + CUDA is a headache I've seen a lot of people have at my uni, and one I've never had to deal with thanks to uv. Good tooling really does go a long way in these things.

Although, I must say that for certain docker pass through cases, the debugging logs just aren't as detailed

ComputerGuru

4 months ago

uv doesn’t fundamentally solve the issues. It didn’t invent venv or pip.

What fundamentally solves the issue is to use an onnx version of the model.

simonw

4 months ago

Do you know if it's possible to run ONNX versions of models on a Mac?

I should try those on the NVIDIA Spark, be interesting to see if they are easy to work with on ARM64.

ComputerGuru

4 months ago

Yup. The beauty of it is that the underlying ai accelerator/hardware is completely abstracted away. There’s a CoreML ONNX execution provider, though I haven’t used it.

No more fighting with hardcoded cuda:0 everywhere.

The only pain point is that you’ll often have to manually convert a PyTorch model from huggingface to onnx unless it’s very popular.

cat_plus_plus

4 months ago

You can still upgrade CUDA within forward compatibility range and install new packages without reflashing.