hackernews client

Omnivision-968M: Vision Language Model with 9x Tokens Reduction for Edge Devices

69 pointsposted a year ago

by BUFU

(nexa.ai)

14 Comments

nighthawk454

a year ago

Easy to try here: https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo

https://i.imgur.com/44XYyXU.png

TacticalCoder

a year ago

I saw a turntable at a shop recently and my inner classifier went: "Oh a DSTOM turntable, that's sweet!"

https://www.project-audio.com/en/product/the-dark-side-of-th...

I was kinda expecting the model in your picture to make the link with the album cover.

jsjohnst

a year ago

Need to try this directly before passing judgement, but this can unlock a few project ideas I have if the quality lives up to the examples with this low of resource requirements.

gizajob

a year ago

Its description of the art piece is so awful.

alanzhuly

a year ago

Hi! I am from Nexa AI. We just improved Omnivision-968M based on your feedback! Here is a preview in our Hugging Face Space: https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo

The updated GGUF and safetensors will be released after final alignment tweaks. Please feel free to let us know if there's any other feedback!

gizajob

a year ago

Why don’t you just hand-write the descriptions and then your AI won’t have to.

I thought the same, but the description of the cat picture is pretty spot on. I wonder if this is a dataset issue. Cat pictures are far more prevalent than abstract art on the internet so might well be overrepresented. Can Vision LLMs deal with a long tail of underrepresented objects when small? Or can they only do so at scale?

throwaway314155

a year ago

Can GitHub please acquire all these model-hub companies like fal, replicate, ollama, hf, and checks notes "nexa.ai"? That way we can get past the inevitable fragmentation and ultimate breaking of everyone's workflow w.r.t. ML-oriented dev ops?

gessha

a year ago

When faced with a diversity of implantation, why is the goto “let’s have a corporate entity acquire them all” instead of “let’s come up with a good runtime standard”. The company is going to do the same thing anyway except with the additional risk of messing up the API and throwing away the hard work of so many people.

Omnivision-968M: Vision Language Model with 9x Tokens Reduction for Edge Devices

14 Comments

nighthawk454

TacticalCoder

jsjohnst

gizajob

alanzhuly

gizajob

ImageXav

throwaway314155

gessha

croes

byyoung3

yq2325

joyboyyy

zhiyuan8