Omnivision-968M: Vision Language Model with 9x Tokens Reduction for Edge Devices

69 pointsposted a year ago
by BUFU

14 Comments

jsjohnst

a year ago

Need to try this directly before passing judgement, but this can unlock a few project ideas I have if the quality lives up to the examples with this low of resource requirements.

gizajob

a year ago

Its description of the art piece is so awful.

alanzhuly

a year ago

Hi! I am from Nexa AI. We just improved Omnivision-968M based on your feedback! Here is a preview in our Hugging Face Space: https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo

The updated GGUF and safetensors will be released after final alignment tweaks. Please feel free to let us know if there's any other feedback!

gizajob

a year ago

Why don’t you just hand-write the descriptions and then your AI won’t have to.

ImageXav

a year ago

I thought the same, but the description of the cat picture is pretty spot on. I wonder if this is a dataset issue. Cat pictures are far more prevalent than abstract art on the internet so might well be overrepresented. Can Vision LLMs deal with a long tail of underrepresented objects when small? Or can they only do so at scale?

throwaway314155

a year ago

Can GitHub please acquire all these model-hub companies like fal, replicate, ollama, hf, and checks notes "nexa.ai"? That way we can get past the inevitable fragmentation and ultimate breaking of everyone's workflow w.r.t. ML-oriented dev ops?

gessha

a year ago

When faced with a diversity of implantation, why is the goto “let’s have a corporate entity acquire them all” instead of “let’s come up with a good runtime standard”. The company is going to do the same thing anyway except with the additional risk of messing up the API and throwing away the hard work of so many people.

croes

a year ago

You want everything under the control of Microsoft?