Text or pixels? On the token efficiency of visual text inputs in multimodal LLMs

2 pointsposted 3 months ago
by hhs

No comments yet