krackers
11 hours ago
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote
>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.
(I guess you could say a picture token is worth 10 textual tokens...)
Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?
And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.
miki123211
6 hours ago
Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.
The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table.
Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers.
Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings.
This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes.
You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger.
There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.
lubesGordi
42 minutes ago
So in terms of OCR, does the neural network 'map' the words into an embedding directly, or is it getting a bunch of words like "Hamlet's monologue" and mapping that to an embedding? Basically what I'm asking is if the neural network image encoder is essentially doing OCR 'internally' when it is coming up with the embedding (if that makes any sense).
isaacfung
an hour ago
Some models use vector quantized variational autoencoders to discretize images into sequences of discrete symbols from a fixed codebook.
https://grok.com/share/bGVnYWN5LWNvcHk%3D_572b4955-6265-4210...
rco8786
6 hours ago
Great explanation, thanks. I was surprised to hear that models still only work with ~100k tokens, but after giving it some thought it makes sense. There's only so many words/subword units that get used in any given language. The entropy comes from all the billions of different ways those subwords can be ordered.
jerf
2 hours ago
Textual language is really, really amazing if you sit down and think about what it does versus the resources it consumes to do it.
It's a common pasttime for programmers to claim that our textual programming languages are just terrible and need to be replaced somehow with something visual, but I think this very often comes from a place of not understanding just how amazing textual languages are. Not they couldn't possibly be improved by something in at least some domains, and there are after all some successful niches for visual languages, but I think if you set out to wholesale replace textual languages without an understanding of and appreciation for the impressive nature of the competition they offer you're setting yourself up to fail.
freeqaz
4 hours ago
There is also a tradeoff between different vocabulary sizes (how many entries exist in the token -> embedding lookup table) that inform the current shape of tokenizers and LLMs. (Below is my semi-armchair stance, but you can read more in depth here[0][1].)
If you tokenized at the character level ('a' -> embedding) then your vocabulary size would be small, but you'd have more tokens required to represent most content. (And context scales non-linearly, iirc, like n^3) This would also be a bit more 'fuzzy' in terms of teaching the LLM to understand what a specific token should 'mean'. The letter 'a' appears in a _lot_ of different words, and it's more ambiguous for the LLM.
On the flip side: What if you had one entry in the tokenizer's vocabulary for each word that existed? Well, it'd be far more than the ~100k entries used by popular LLMs, and that has some computational tradeoffs like when you calculate the probability of each 'next' token via softmax, you'd have to run that for each token, as well as increasing the size of certain layers within the LLM (more memory + compute required for each token, basically).
Additionally, you run into a new problem: 'Rare Tokens'. Basically, if you have infinite tokens, you'll run into specific tokens that only appear a handful of times in the training data and the model is never able to fully imbue the tokens with enough meaning for them to _help_ the model during inference. (A specific example being somebody's username on the internet.)
Fun fact: These rare tokens, often called 'Glitch Tokens'[2], have been used for all sorts of shenanigans[3] as humans learn to break these models. (This is my interest in this as somebody who works in AI security)
As LLMs have improved, models have pushed towards the largest vocabulary they can get away with without hurting performance. This is about where my knowledge on the subject ends, but there have been many analyses done to try to compute the optimal vocabulary size. (See the links below)
One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8) or directly against the final layers of state in a small LLM (trying to use a small LLM to 'grok' the meaning and hoist it into a more dense, almost compressed latent space that the large LLM can understand).
It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...
This immediately makes the model's inner state (even more) opaque to outside analysis though. e.g., like why using gRPC as the protocol for your JavaScript front-end sucks: Humans can't debug it anymore without other tooling. JSON is verbose as hell, but it's simple and I can debug my REST API with just network inspector. I don't need access to the underlying Protobuf files to understand what each byte means in my gRPC messages. That's a nice property to have when reviewing my ChatGPT logs too :P
Exciting times!
0: https://www.rohan-paul.com/p/tutorial-balancing-vocabulary-s...
1: https://arxiv.org/html/2407.13623v1
2: https://en.wikipedia.org/wiki/Glitch_token
3: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
rco8786
3 hours ago
Again, super interesting thanks!
> One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8)
I've had similar ideas in the past. High level languages that humans write are designed for humans. What does an "LLM native" programming language look like? And, to your point about protobufs vs JSON, how does a human debug it when the LLM gets stuck?
> It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...
That's basically the strategy for Claude's new "Skills" feature, just in a more dynamic/AI driven way. Claude will do semantic search through YAML frontmatter to determine what skill might be useful in a given context, then load that entire skill file into context to execute it. Your idea here is similar, use a small local model to summarize each file (basically dynamically generate that YAML front matter), feed those into the larger model's context, and then it can choose which file(s) it cares about based on that.
ttul
3 hours ago
This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-dimensional information into concepts again.
So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.
The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.
miki123211
2 hours ago
Text is actually one-dimensional, writing is two-dimensional.
To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship between character 15 of line 1 and character 15 of line 2.
A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.
jph00
3 hours ago
Actually there are VAEs which use a codebook approach to creating discrete tokens instead of float vectors. There has been some success in that direction in diffusion models for instance.
ssivark
2 hours ago
Surely the appropriate ratio depends on the resolution of each character, relative to the size of the vision token patch? That is the only way the number of text tokens needed to describe the output of OCR can be independent of the resolution of the image (as it should).
HarHarVeryFunny
4 hours ago
I don't know if there is any common practice among multi-modal input "LLM"s as to how they encode image inputs - convert them into "vision tokens", but it's basically going to come down to splitting the image into a grid of regions and encoding those.
I'm not sure there's any information theoretic intuition to be had with DeepSeek's experiments - it seems to be more about what's the lowest resolution image resolution/grid you can get away with and still capture enough image detail to be able to accurately perform OCR on it.
It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically done.
runeblaze
9 hours ago
each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously compresses much more than subword slices.
disclaimer: not expert, on top of my head
looobay
11 hours ago
LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.
Maybe they would render texts to an image before tokenizing to reduce the compute cost.
krackers
10 hours ago
But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right?
So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.
HarHarVeryFunny
an hour ago
A text token generally represents a portion of a single word, while a vision token represents a portion of the entire page, which may include multiple words. This is where the "compression factor" comes from.
The number of bits to represent a text or vision token is the same, since they are both represented as embeddings of a fixed number of dimensions defined by the Transformer (maybe a few thousand for a large SOTA model).
Whether a vision token actually contains enough information to accurately extract (OCR) all the text data from that portion of the image is going to depend on how many pixels that vision token represents and how many words were present in that area of the image. It's just like considering images of the same page of text at different resolutions - a 1024x1024 image vs a 64x64 one, etc. As the resolution decreases so will OCR accuracy. At some point the resolution is insufficient and the words become a blurry mess and OCR accuracy suffers.
This is what DeepSeek are reporting - OCR accuracy if you try to use a single vision token to represent, say, 10 text tokens, vs 20 text tokens. The vision token may have enough resolution to represent 10 tokens well, but not enough for 20.
f33d5173
10 hours ago
Vision is how humans see text. So text must have built in adaptations to protect from visual noise. For example, two words that look similar must never appear in similar contexts, or else they would be conflated. Hence we can safely reduce such words to the same token. Or something like that.
fxtentacle
6 hours ago
That also works purely on text and it's the trick I used in my German speech recognition engine ( https://arxiv.org/abs/2206.12693 ).
"I'm studying at Oxford Univ" has basically no loss in meaning even though "University" was truncated to less than half its characters.
UltraSane
2 hours ago
This is like how many CLIs accept the shortest unique version of commands.
ffsm8
7 hours ago
Is that really factual/true?
Lots of words have multiple meanings and can mean different things even if used in the same sentence/context just from the interpretation of the person reading it.
Heck, it'd argue that most (not all) dayjob conflicts are down to such differences in interpretation /miscommunications
psb217
9 hours ago
The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.
imjonse
10 hours ago
I wonder if text written using chinese characters is more compatible with such vision centric compression than latin text.
looobay
10 hours ago
Vision tokens are a good compression medium because with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements, because one vision token represent multiple pixels (and possibly multiple words). This is why its a good compression medium for compute.
It will never be as precise as textual tokens but it can be really good as they show in the paper.
krackers
10 hours ago
>with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements
Each vision token represents a 16x16 patch, but to fully cover a word you might need multiple vision tokens. So assuming that the embedding size of the vision token and text token is the same `d` (which I think has to be the case for multimodal models), then wouldn't the fair comparison be `x * d` elements for a sentence in terms of vision tokens, and `y * d` for the same sentence in terms of text tokens? I don't see how you could see a priori that x << y (especially by a factor of 10 as quoted in the paper).
That said, if I do experimentally try this by shrinking this very comment down to the smallest font size I can read it at, then seeing how many 16x16 tokens it takes, you can fit more text than I expected in each "vision token". So I can maybe buy that x is at least not greater than y. But it can't be as simple as "each vision token can cover more text", since that only enables better compression if the encoder can actually uncover some sort of redundancy within each token. (And presumably the type of redundancy it uncovers probably isn't something that "classical" compression techniques can exploit, otherwise it seems like it would have been tried by now?).
looobay
10 hours ago
You should read the 6th page of the paper (and page 5 for architecture breakdown), they show that they are compressing the vision tokens with convolution to keep a strong semantic understanding and keep a small amount of tokens.
But I think it's still experimentall.
numpad0
9 hours ago
just a hunch but like, from something to do with Unicode?