simonw
a day ago
I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).
I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job, incomplete but still unbelievable for a model that tiny: https://gist.github.com/simonw/64c5f5b111fe473999144932bef42...
More of my notes here: https://simonwillison.net/2024/Sep/25/llama-32/
I've been trying out the larger image models to using the versions hosted on https://lmarena.ai/ - navigate to "Direct Chat" and you can select them from the dropdown and upload images to run prompts.
faangguyindia
14 hours ago
If you are in US, you get 1 billion tokens a DAY with Gemini (Google) completely free of cost.
Gemini Flash is fast with upto 4 million token context.
Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o
You can simply use Gemini Flash for Code Completion, git review tool and many more.
a2128
11 hours ago
Is this sustainable though, or are they just trying really hard to attract users? If I build all of my tooling on it, will they start charging me thousands of dollars next year once the subsidies dry up? With a local model running with open source software, at least I can know that as long as my computer can still compute, the model will still run just as well and just as fast as it did on day 1, and cost the same amount of electricity
QuinnyPig
6 hours ago
Facts. Google did the same thing you describe with Maps a few years ago.
moffkalast
5 hours ago
It's not just Google, literally every new service always does this. Prices will always go up once the have enough customers and bean counters start pointing at spreadsheets. Ergo, local is the only option if you don't want to be held for ransom afterwards. As goes for web servers, scraper bots, and whatever, so goes for llms.
phillipcarter
8 hours ago
I think there's a few things to consider:
They make a ton of money on large enterprise package deals through Google Cloud. That includes API access but also support and professional services. Most orgs that pay for this stuff don't really need it, but they buy it anyways, as is consistent with most enterprise sales. That can give Google a significant margin to make up the cost elsewhere.
Gemini Flash is probably super cheap to run compared to other models. The cost of inference for many tasks has gone down tremendously over the past 1.5 years, and it's still going down. Every economic incentive aligns with running these models more efficiently.
4ndrewl
9 hours ago
It's Google. You know the answer ;)
rl3
9 hours ago
I mean, there’s no need to dry up subsidies when the underlying product can just be deprecated without warning.
rcpt
7 hours ago
Aren't API calls essentially swappable now between vendors now?
If you wanted to switch from Gemini to Chatgpt you could copy/paste your code into Chatgpt and ask it to switch to their API.
Disclaimer I work at Google but not on Gemini
snek_case
6 hours ago
Different APIs and models are going to come with different capabilities and restrictions.
zitterbewegung
7 hours ago
Not tokens allowed per user. Google has the largest token windows .
zitterbewegung
7 hours ago
Run test queries on all platforms using something like litellm [1] and langsmith [2] .
You may not be able to match large queries but, testing will help you transition to other services.
faangguyindia
8 hours ago
Google has deep pockets and SOTA hardware for training and interference
stavros
10 hours ago
Are you asking whether giving away $5/day/user (what OpenAI charges) in compute is sustainable?
nycdatasci
12 hours ago
This is great for experimentation, but as others have pointed out recently there are persistent issues with Gemini that prevent use in actual products. The recitation/self-sensoring issue results in random failures:
faangguyindia
9 hours ago
I had this problem too but 002 solves this I think (not tested exhaustively), but I've not run into any problems since 002 and vertex + block all on all safety is now working fine, earlier I had problems with "block all" in safety settings and api throwing errors.
I am using it in https://github.com/zerocorebeta/Option-K (currently it doesn't have lowest safety settings because api wouldn't allow it, but now I am going to push new update with safety disabled)
Why? I've another application which is working since yesterday after 002 launch, I've safety settings to none and it will not answer certain questions but since yesterday it answers everything.
o11c
8 hours ago
And yet - if Gemini actually bothers to tell you when it detects verbatim copying of copyrighted content, how often must that occur on other AIs without notice?
Deathmax
8 hours ago
The free tier API isn't US-only, Google has removed the free tier restriction for UK/EEA countries for a while now, with the added bonus of not training on your data if making a request from the UK/CH/EEA.
airspresso
6 hours ago
Free of cost != free open model. Free of cost means all your requests are logged for Google to use as training data and whatnot.
Llama3.2 on the other hand runs locally, no data is ever sent to a 3rd party, so I can freely use it to summarize all my notes regardless of one of them being from my most recent therapy session and another being my thoughts on how to solve a delicate problem involving politics at work. I don't need to pre-classify all the input to make sure it's safe to share. Same with images, I can use Llama3.2 11B locally to interpret any photo I've taken without having to worry about getting consent from the people in the photo to share it with a 3rd party, or whether the photo is of my passport for some application I had to file or a receipt of something I bought that I don't want Google to train their next vision model OCR on.
TL;DR - Google free of cost models are irrelevant when talking about local models.
hobofan
10 hours ago
Not locked to the US, you get 1 billion tokens per month per model with Mistral since their recent announcement: https://mistral.ai/news/september-24-release/ (1 request per second is quite a harsh rate limit, but hey, free is free)
I'm pretty excited what all the services adopting free tiers is going to do to the landscape, as that should allow for a lot more experimentation and a lot more hobby projects transitioning into full-time projects, that previously felt a lot more risky/unpredictable with pricing.
jackbravo
a day ago
I saw that you mention https://github.com/simonw/llm/. Hadn't seen this before. What is its purpose? And why not use ollama instead?
dannyobrien
a day ago
llm is Simon's command line front-end to a lot of the llm apis, local and cloud-based. Along with aider-chat, it's my main interface to any LLM work -- it works well with a chat model, one-off queries, and piping text or output into a llm chain. For people who live on the command line, or are just put-off by web interfaces, it's a godsend.
About the only thing I need to look further abroad for is when I'm working multi-modally -- I know Simon and the community are mainly noodling over the best command line UX for that: https://github.com/simonw/llm/issues/331
n8henrie
a day ago
I've only used ollama over cli. As per the parent poster -- do you know if there are advantages over ollama for CLI use? Have you used both?
simonw
21 hours ago
Ollama can’t talk to OpenAI / Anthropic / etc. LLM gives you a single interface that can talk to both hosted and local models.
It also logs everything you do to a SQLite database, which is great for further analysis.
I use LLM and Ollama together quite a bit, because Ollama are really good at getting new models working and their server keeps those models in memory between requests.
wrsh07
20 hours ago
You can run llamafile as a server, too, right? Still need to download gguf files if you don't use one of their premade binaries, but if you haven't set up llm to hit the running llamafile server I'm sure that's easy to do
dannyobrien
a day ago
I haven't used Ollama, but from what I've seen, it seems to operate at a different level of abstraction compared to `llm`. I use `llm` to access both remote and local models through its plugin ecosystem[1]. One of the plugins allows you to use Ollama-served local models. This means you can use the same CLI interface with Ollama[2], as well as with OpenAI, Gemini, Anthropic, llamafile, llamacpp, mlc, and others. I select different models for different purposes. Recently, I've switched my default from OpenAI to Anthropic quite seamlessly.
[1] - https://llm.datasette.io/en/stable/plugins/directory.html#pl... [2] - https://github.com/taketwo/llm-ollama
awwaiid
21 hours ago
The llm CLI is much more unixy, letting you pipe data in and out easily. It can use hosted and local models, including ollama.
SOLAR_FIELDS
19 hours ago
I use a fair amount of aider - what does Simon's solution offer that aider doesn't? I am usually using a mix of aider and the ChatGPT window. I use ChatGPT for one off queries that aren't super context heavy for my codebase, since pricing can still add up for the API and a lot of the times the questions that I ask don't really need deep context about what I'm doing in the terminal. But when I'm in flow state and I need deep integration with the files I'm changing I switch over to aider with Sonnet - my subjective experience is that Anthropic's models are significantly better for that use case. Curious if Simon's solution is more geared toward the first use case or the second.
skybrian
18 hours ago
The llm command is a general-purpose tool for writing shell scripts that use an llm somehow. For example, generating some llm output and sending it though a Unix pipeline. You can also use it interactively if you like working on the command line.
It’s not specifically about chatting or helping you write code, though you could use it for that if you like.
flakiness
5 hours ago
There is a recent podcast episode with the tool's author https://newsletter.pragmaticengineer.com/p/ai-tools-for-soft...
It's worth listening to learn abouut the context on how that tool is used.
jerieljan
a day ago
It looks like a multi-purpose utility in the terminal for bridging together the terminal, your scripts or programs to both local and remote LLM providers.
And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.
https://llm.datasette.io/en/stable/
I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.
lowyek
20 hours ago
Hi simon, is there a way to run the vision model easily on my mac locally?
simonw
20 hours ago
Not that I’ve seen so far, but Ollama are pending a solution for that “soon”.
v3ss0n
19 hours ago
I doubt ollama team can do much about it. Ollama are just wrapper on top of heavy lifter
Patrick_Devine
18 hours ago
The draft PRs are already up in the repo.
theaniketmaurya
12 hours ago
You can run it with LitServe (MPS GPU), here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...
GaggiX
a day ago
Llama 3.2 vision models don't seem that great if they have to compare them to Claude 3 Haiku or GPT4o-mini. For an open alternative I would use Qwen-2-72B model, it's smaller than the 90B and seems to perform quite better. Also Qwen2-VL-7B as an alternative to Llama-3.2-11B, smaller, better in visual benchmarks and also Apache 2.0.
Molmo models: https://huggingface.co/collections/allenai/molmo-66f379e6fe3..., also seem to perform better than Llama-3.2 models while being smaller and Apache 2.0.
dannyobrien
a day ago
What interface do you use for a locally-run Qwen2-VL-7B? Inspired by Simon Willison's research[1], I have tried it out on Hugging Face[2]. Its handwriting recognition seems fantastic, but I haven't figured out how to run it locally yet.
[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B
Eisenstein
21 hours ago
MiniCPM-V 2.6 is based on Qwen 2 and is also great at handwriting. It works locally with KoboldCPP. Here are the results I got with a test I just did.
Image:
Output:
* https://pastebin.com/RKvYQasi
OCR script used:
* https://github.com/jabberjabberjabber/LLMOCR/blob/main/llmoc...
Model weights: MiniCPM-V-2_6-Q6_K_L.gguf, mmproj-MiniCPM-V-2_6-f16.gguf
Inference:
* https://github.com/LostRuins/koboldcpp/releases/tag/v1.75.2
jona-f
13 hours ago
Should the line "p.o. 5rd w/ new W5 533" say "p.o. 3rd w/ new WW 5W .533R"?
What does p.o. stand for? I can't make out the first letter. It looks more like the f, but the nodge on the upper left only fits the p. All the other p's look very different though.
Eisenstein
8 hours ago
'Replaced R436, R430 emitter resistors on right-channel power output board with new wire-wound 5watt .33ohm 5% with ceramic lead insulators'
jona-f
3 hours ago
Thx :). I thought the 3 looked like a b but didn't think brd would make any sense. My reasoning has led me astray.
Eisenstein
2 hours ago
Yeah. If you realize that a large part of the llm's 'ocr' is guessing due to context (token prediction) and not actually recognizing the characters exactly, you can see that it is indeed pretty impressive because the log it is reading uses pretty unique terminology that it couldn't know from training.
hansoolo
13 hours ago
Thanks for the hint. Will try the out!
f38zf5vdt
a day ago
1. Ignore the benchmarks. I've been A/Bing 11B today with Molmo 72B [1], which itself has an ELO neck-and-neck with GPT4o, and it's even. Because everyone in open source tends to train on validation benchmarks, you really can not trust them.
2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.
espadrine
11 hours ago
> I've been A/Bing 11B today with Molmo 72B
How are you testing Molmo 72B? If you are interacting with https://molmo.allenai.org/, they are using Molmo-7B-D.
sumedh
13 hours ago
I tried some OCR use cases, Claude Sonnet just blows Molmo.
benreesman
19 hours ago
It’s not just open source that trains on the validation set. The big labs have already forgotten more about gaming MMLU down to the decimal than the open source community ever knew. Every once in a while they get sloppy and Claude does a faux pas with a BIGBENCH canary string or some other embarrassing little admission of dishonesty like that.
A big lab gets exactly the score on any public eval that they want to. They have their own holdouts for actual ML work, and they’re some of the most closely guarded IP artifacts, far more valuable than a snapshot of weights.
GaggiX
a day ago
How about its performance compare to Qwen-2-72B tho?
f38zf5vdt
20 hours ago
Refer to the blog post I linked. Molmo is ahead of Qwen2 72b.
forgingahead
a day ago
What are people using to check token length of code bases? I'd like to point certain app folders to a local LLM, but no idea how that stuff is calculated? Seems like some strategic prompting (eg: this is a rails app, here is the folder structure with file names, and btw here are the actual files to parse) would be more efficient than just giving it the full app folder? No point giving it stuff from /lib and /vendor for the most part I reckon.
xyc
5 hours ago
You can use llama.cpp server's tokenize endpoint to tokenize and count the tokens: https://github.com/ggerganov/llama.cpp/blob/master/examples/...
simonw
21 hours ago
I use my https://github.com/simonw/ttok command for that - you can pipe stuff into it for a token count.
Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.
sumedh
13 hours ago
You can try Gemini Token count. https://ai.google.dev/api/tokens
foxhop
a day ago
The llama 3.0, 3.1, & 3.2 all use the TikToken tokenizer which is the open source openai tokenizer.
littlestymaar
a day ago
GP is talking about context windows, not the number of token used by the tokenizer.
sva_
a day ago
Somewhat confusingly, it appears the tokenizer vocabulary as well as the context length are both 128k tokens!
littlestymaar
18 hours ago
Yup, that's why I wanted to clarify things.
TZubiri
8 hours ago
This obsession with using AI to help with programming is short sighted.
We discover gold and you think of gold pickaxes.