Text or pixels? On the token efficiency of visual text inputs in multimodal LLMs

2 pointsposted 19 hours ago
by hhs

No comments yet