jjcm
19 hours ago
1 bit with a FP16 scale factor every 128 bits. Fascinating that this works so well.
I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.
On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.
Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.
Pictures of the results here: https://x.com/pwnies/status/2039122871604441213
There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev
najarvg
18 hours ago
Thanks for sharing the link to your instance. Was blazing fast in responding. Tried throwing a few things at it with the following results: 1. Generating an R script to take a city and country name and finding it's lat/long and mapping it using ggmaps. Generated a pretty decent script (could be more optimal but impressive for the model size) with warnings about using geojson if possible 2. Generate a latex script to display the gaussian integral equation - generated a (I think) non-standard version using probability distribution functions instead of the general version but still give it points for that. Gave explanations of the formula, parameters as well as instructions on how to compile the script using BASH etc 3. Generate a latex script to display the euler identity equation - this one it nailed.
Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response
jjcm
18 hours ago
> Was blazing fast in responding.
I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.
ineedasername
16 hours ago
consumer hardware?
That... pft. Nevermind, I'm just jealous
jjcm
15 hours ago
Look it was my present to myself after the Figma IPO (worked there 5 years). If you want to feel less jealous, look at the stock price since then.
abrookewood
15 hours ago
Holy hell ... that's a monster of a card
najarvg
18 hours ago
I must add that I also tried out the standard "should I walk or drive to the carwash 100 meters away for washing the car" and it made usual error or suggesting a walk given the distance and health reasons etc. But then this does not claim to be a reasoning model and I did not expect, in the remotest case, for this to be answered correctly. Ever previous generation larger reasoning models struggle with this
jjcm
18 hours ago
I ran it through a rudimentary thinking harness, and it still failed, fwiw:
The question is about the best mode of transportation to a car wash located 100 meters away. Since the user is asking for a recommendation, it's important to consider practical factors like distance, time, and convenience.
Walking is the most convenient and eco-friendly option, especially if the car wash is within a short distance. It avoids the need for any transportation and is ideal for quick errands.
Driving is also an option, but it involves the time and effort of starting and stopping the car, parking, and navigating to the location.
Given the proximity of the car wash (100 meters), walking is the most practical and efficient choice. If the user has a preference or if the distance is longer, they can adjust accordingly.nlaslett
4 hours ago
And to be fair, you asked about traveling to a location. It just so happens that location is a car wash. You didn't say anything about wanting to wash the car; that's an inference on your part. A reasonable inference based on human experience, sure, but still an inference. You could just as easily want to go to the car wash because that's where you work, or you are meeting somebody there.
monarchwadia
5 hours ago
Honestly, the fact that we have models that can coherently reason about this problem at all is a technological miracle. And to have it runnable in a 1.15GB memory footprint? Is insanity.
CamperBob2
an hour ago
Exactly. It's not that the pig dances poorly, or that the dog's stock tips never seem to pan out. It's the fact that it's happening at all.
monarchwadia
8 minutes ago
But the fact that we have convinced a pig to dance, and trained a dog to provide stock tips? That can be improved upon over time. We've gotten here, haven't we? It really is a miracle, and I'll stick to that opinion.
AnthonBerg
10 hours ago
As someone whose brain was addled by exposure to art history, I strongly support the suggested pelican on bicycle.
adityashankar
18 hours ago
here's the google colab link, https://colab.research.google.com/drive/1EzyAaQ2nwDv_1X0jaC5... since the ngrok like likely got ddosed by the number of individuals coming along
qingcharles
13 hours ago
Thanks, that works. I only tested the 1.7B. It has that original GPT3 feel to it. Hallucinates like crazy when it doesn't know something. For something that will fit on a GTX1080, though, it's solid.
We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?
jjcm
18 hours ago
Good call. Right now though traffic is low (1 req per min). With the speed of completion I should be able to handle ~100x that, but if the ngrok link doesn't work defo use the google colab link.
adityashankar
18 hours ago
The link didn't work for me personally, but that may be a bandwidth issue with me fighting for a connection in the EU
andai
16 hours ago
Thanks. Did you need to use Prism's llama.cpp fork to run this?
jjcm
15 hours ago
Yep.
andai
15 hours ago
Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.
Sample output below (the model's response to "hi" in the forked llama-cli):
X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in
freakynit
14 hours ago
I have older M1 air with 8GB, but still getting ober 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size.
1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git`
2. Then (assuming you already have xcode build tools installed):
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
3. Finally, run it with (you can adjust arguments): ./build/bin/llama-server -m ~/Downloads/Bonsai-8B.gguf --port 80 --host 0.0.0.0 --ctx-size 0 --parallel 4 --flash-attn on --no-perf --log-colors on --api-key some_api_key_string
Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/mainfreakynit
13 hours ago
To the author: why is this taking 4.56GB ? I was expecting this to be under 1GB for 4B model. https://ibb.co/CprTGZ1c
And this is when Im serving zero prompts.. just loaded the model (using llama-server).
jjcm
14 hours ago
I did this: https://image.non.io/2093de83-97f6-43e1-a95e-3667b6d89b3f.we...
Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.
Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main
I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.
rjh29
17 hours ago
I reminds me of very early ChatGPT with mostly correct answers but some nonsense. Given its speed, it might be interesting to run it through a 'thinking' phase where it double checks its answers and/or use search grounding which would make it significantly more useful.
uf00lme
18 hours ago
The speed is impressive, I wish it could be setup for similar to speculative decoding
abrookewood
15 hours ago
man, that is really really quick. What is your desktop setup??? GPU?
jjcm
15 hours ago
It is fast, but I do have good hardware. A few people have asked for my local inference build, so I have an existing guide that mirrors my setup: https://non.io/Local-inference-build
pdyc
15 hours ago
thanks, i tested it, failed in strawberry test. qwen 3.5 0.8B with similar size passes it and is far more usable.
algoth1
9 hours ago
Does asking it to think step by step, or character by character, improves the answer? It might be a tokenization+unawareness of its own tokenization shortcomings
pdyc
8 hours ago
no it did not with character by character it concluded 2 :-)
selcuka
14 hours ago
Interesting. Qwen 3.5 0.8B failed the test for me.
hmokiguess
18 hours ago
wow that was cooler than I expected, curious to embed this for some lightweight semantic workflows now