simonw
a year ago
This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs on a M2 64GB MacBook and it felt speedy and used 1000% of CPU across 13 threads (according to Activity Monitor).
cd /tmp
git clone https://github.com/samuel-vitorino/lm.rs
cd lm.rs
RUSTFLAGS="-C target-cpu=native" cargo build --release --bin chat
curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true'
curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
./target/release/chat --model llama3.2-1b-it-q80.lmrslittlestymaar
a year ago
Could you try with
./target/release/chat --model llama3.2-1b-it-q80.lmrs --show-metrics
To know how many token/s you get?simonw
a year ago
Nice, just tried that with "tell me a long tall tale" as the prompt and got:
Speed: 26.41 tok/s
Full output: https://gist.github.com/simonw/6f25fca5c664b84fdd4b72b091854...jodleif
a year ago
How much with llama.cpp? A 1b model should be a lot faster on a m2
hobofan
a year ago
Given the fact that this at the core relies on the `rayon` and `wide` libraries, which are decently baseline optimized but quite a bit away from what llama.cpp can do when being specialized on such a specific use-case, I think the speed is about what I would expect.
So yeah, I think there is a lot of room for optimization, and the only reason one would use this today is if they want to have a "simple" implementation that doesn't have any C/C++ dependencies for build tooling reasons.
littlestymaar
a year ago
Your point is valid when it comes to rayon (I don't know much about wide) being inherently slower than custom optimization, but from what I've seen I suspect rayon isn't even the bottleneck in terms of performance, there's some decent margin of improvement (I'd expect at least double the throughput) without even doing arcane stuff.
black_puppydog
a year ago
Dear spam detectors, I hope this doesn't upset you. :)
You: Hello there, do you speak french like me? :)
Assistant: Bonjour (ahn-JOO-reh), it was great meeting you! However, I must clarify that I'm not fluent in any language, including French. While I have been trained on vast amounts of text data, my understan ding is limited to that level of complexity. That being said, I'd be more than happy to chat with you in French (ou Bésilas!) - zut! Do you want to converse en français ?
You: So if you speak no language, does that include english then?
Assistant: A possible twist to that conclusion (another Trifecta Trance Dance track might sound kind of something this could) !
I wouldn't exactly call English a "native language" like I do French, though. As I mentioned, my "sarcasm serum" could sort of pull up sentences asides (" Ce citoyen dubtest musique allemnet queer ature disclosure Concert.leaspect zzCALL >963 aggressionyyyy extremists Thor distributor Independent sanitary threatening Bakery Whitney shareholders Integranga objections constituency" /. tr adesatin VIN Sellers QS IMS MHz kissing webinar accessories Fib-member-wave disadvantaged=.Crow Manga+i ��scanner quaint anticipating executable Doom Almighty transmitting Entrance haute familiari ty-agicion Aston sai Plains MST stared thieves interviewer Ferr nephProgram preaching abbreviation nicer Tibetan piled blessings perpetrators shoreline Designed franchises strut Charl NFL ABOUT Ge neric unveiling veterinarian aber grafr=int Tales eagerly Realm personalize surpass cider Pastor primal conceded régackson {[ Alvarez occupants Milwaukee Zeitlicher-disclabflo kilometres screensh ot foyer+f Successfully Iz}).__(' Stamp ordained Jeremy packing Eating UV impoverished Devon geopolitical Gum tightened Roof Hormansturniture rebelledge conducts Baxter tyranny diaper reciproape t ik87 prepaid iceberg accommodating/sh autobymbolels Freund'))vereendar Quotes removable-parent simplistic nanop Releases Measures disappointing Roc insurg bizberries Metric Ellis merciless[][] Bra y sighed RU believers MHz impulses Difficulty contamin Woody shouted tast endanger Gemini allergic redirection Leicester Patricia Ferguson hooked Estimate Nailston geopolitical AJAX concatenate hu t Impossible cheesy XY Advances gallonF misguided bait traces reused OECD CAMRobert Ist HIV wp fellows aromatic rebell gallons =>members Nintendo cf Thing landmarks Alias usur offender Proposed mi
[continues endless garbage]
Edited for formatting.
amelius
a year ago
Not sure how to formulate this, but what does this mean in the sense of how "smart" it is compared to the latest chatgpt version?
simonw
a year ago
The model I'm running here is Llama 3.2 1B, the smallest on-device model I've tried that has given me good results.
The fact that a 1.2GB download can do as well as this is honestly astonishing to me - but it's going to laughably poor in comparison to something like GPT-4o - which I'm guessing is measured in the 100s of GBs.
You can try out Llama 3.2 1B yourself directly in your browser (it will fetch about 1GB of data) at https://chat.webllm.ai/
iknowstuff
a year ago
anyone else think 4o is kinda garbage compared to the older gpt4? as well as o1-preview and probably o1-mini.
gpt4 tends to be more accurate than 4o for me.
airstrike
a year ago
I sort of do, especially against OG GPT-4 (before turbo)
4o is a bit too lobotomized for my taste. If you try to engage in conversation, nearly every answer after the first starts with "You're absolutely right". Bro, I don't know if I'm right, that's why I'm asking a question!
It's somehow better in _some_ scenarios but I feel like it's also objectively worse in others so it ends up being a wash. It paradoxically looks bad relative to GPT-4 but also makes GPT-4 feel worse when you go back to it...
o1-preview has been growing on me despite its answers also being very formulaic (relative to the OG GPT-3.5 and GPT-4 models which had more "freedom" in how they answered)
iammrpayments
a year ago
Yes, I use 4o for customer support in multiple languages and sometimes I have to tell it to reply using the customer language, while gpt4 could easily infer it.
csomar
a year ago
gpt-4o is a weak version of gpt-4 with "steps-instructions". Gpt-4 is just too expensive which is why openAI is releasing all these mini versions.
MuffinFlavored
a year ago
> that has given me good results.
Can you help somebody out of the loop frame/judge/measure 'good results'?
Can you give an example of something it can do that's impressive/worthwhile? Can you give an example of where it falls short / gets tripped up?
Is it just a hallucination machine? What good does that do for anybody? Genuinely trying to understand.
simonw
a year ago
It can answer basic questions ("what is the capital of France"), write terrible poetry ("write a poem about a pelican and a walrus who are friends"), perform basic summarization and even generate code that might work 50% of the time.
For a 1.2GB file that runs on my laptop those are all impressive to me.
Could it be used for actual useful work? I can't answer that yet because I haven't tried. The problem there is that I use GPT-4o and Claude 3.5 Sonnet dozens of times a day already, and downgrading to a lesser model is hard to justify for anything other than curiosity.
littlestymaar
a year ago
The implementation has no control on “how smart” the model is, and when it comes to llama 1B, it's not very smart by current standard (but it would still have blown everyone's mind just a few years back).
KeplerBoy
a year ago
The implementation absolutely can influence the outputs.
If you have a sloppy implementations which somehow accumulates a lot of error in it's floating point math, you will get worse results.
It's rarely talked about, but it's a real thing. Floating point addition and multiplication is non-associative and the order of operations affects the correctness and performance. Developers might (unknowningly) trade performance for correctness. And it matters a lot more in the low precision modes we operate today. Just try different methods of summing a vector containing 9,999 fp16 ones in fp16. Hint: it will never be 9,999.0 and you won't get close to the best approximation if you do it in a naive loop.
jiggawatts
a year ago
I thought all current implementations accumulate into a fp32 instead of accumulating in fp16.
janwas
a year ago
We (gemma.cpp) recently started accumulating softmax terms into f64. There is at least one known case of this causing differing output, but after 200 tokens, hence unlikely to be detected in many benchmarks.
Does anyone have experience with higher-precision matmul and whether it is worthwhile?
ComputerGuru
a year ago
Isn’t 200 tokens basically nothing? Did you mean to say 2000?
janwas
a year ago
That's indeed short for some actual uses such as summarization, but AFAIK many/most? evals involve generating less than 200.
KeplerBoy
a year ago
I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.
sroussey
a year ago
How well does bf16 work in comparison?
KeplerBoy
a year ago
Even worse, I'd say since it has fewer bits for the fraction. At least in the example i was mentioning, where you run into precision limits, not into range limits.
I believe bf16 was primarily designed as a storage format, since it just needs 16 zero bits added to be a valid fp32.
littlestymaar
a year ago
TIL, thanks.
user
a year ago