wohoef
7 months ago
In my experience LLMs have a hard time working with text grids like this. It seems to find columns harder to “detect” then rows. Probably because it’s input shows it as a giant row if that makes sense.
It has the same problem with playing chess. But I’m not sure if there is a datatype it could work with for this kinda game. Currently it seems more like LLMs can’t really work on spacial problems. But this should actually be something that can be fixed (pretty sure I saw an article about it on HN recently)
fi-le
7 months ago
Good point. The architectural solution that would come to mind is 2D text embeddings, i.e. we add 2 sines and cosines to each token embedding instead of 1. Apparently people have done it before: https://arxiv.org/abs/2409.19700v2
ninjha
7 months ago
I think I remember one of the original ViT papers saying something about 2D embeddings on image patches not actually increasing performance on image recognition or segmentation, so it’s kind of interesting that it helps with text!
E: I found the paper: https://arxiv.org/pdf/2010.11929
> We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4).
Although it looks like that was just ImageNet so maybe this isn't that surprising.
yorwba
7 months ago
They seem to have used a fixed input resolution for each model, so the learnable 1D position embeddings are equivalent to learnable 2D position embeddings where every grid position gets its own embedding. It's when different images may have a different number of tokens per row that the correspondence between 1D index and 2D position gets broken and a 2D-aware position embedding can be expected to produce different results.
froobius
7 months ago
Transformers can easily be trained / designed to handle grids, it's just that off the shelf standard LLMs haven't been particularly, (although they would have seen some)
nine_k
7 months ago
Are there some well-known examples of success in it?
thethimble
7 months ago
Vision transformers effectively encode a grid of pixel patches. It’s ultimately a matter of ensuring the position encoding incorporates both X and Y and position.
For LLMs we only have one axis of position and - more importantly - the vast majority of training data only is oriented in this way.
stavros
7 months ago
If this were a limitation in the architecture, they wouldn't be able to work with images, no?
hnlmorg
7 months ago
LLMs don’t work with images.
stavros
7 months ago
They do, though.
hnlmorg
7 months ago
Do they? I thought it was completely different models that did image generation.
LLMs might be used to translate requests into keywords, but I didn’t think LLMs themselves did any of the image generation.
Am I wrong here?
stavros
7 months ago
Yes, that's why ChatGPT can look at an image and change the style, or edit things in the image. The image itself is converted to tokens and passed to the LLM.
hnlmorg
7 months ago
LLMs can be used as an agent to do all sorts of clever things, but it doesn’t mean the LLM is actually handling the original data format.
I’ve created MCP servers that can scrape websites but that doesn’t mean the LLM itself can make HTTP calls.
The reason I make this distinction is because someone claimed that LLMs can read images. But they don’t. They act as an agent for another model that reads images and creates metadata from it. LLMs then turn that meta data into natural language.
The LLM itself doesn’t see any pixels. It sees textual information that another model has provided.
Edit: reading more about this online, it seems LLMs can work with pixel level data. I had no idea that was possible.
My apologies.
stavros
7 months ago
No problem. Again, if it happened the way you described (which it did, until GPT-4o recently), the LLM wouldn't have been able to edit images. You can't get a textual description of an image and reconstruct it perfectly just from that, with one part edited.
tomalbrc
7 months ago
We have been able to edit images since Stable Diffusion