Forget ChatGPT: why researchers now run small AIs on their laptops

705 pointsposted a year ago
by rbanffy

224 Comments

noman-land

a year ago

For anyone who hasn't tried local models because they think it's too complicated or their computer can't handle it, download a single llamafile and try it out in just moments.

https://future.mozilla.org/builders/news_insights/introducin...

https://github.com/Mozilla-Ocho/llamafile

They even have whisperfiles now, which is the same thing but for whisper.cpp, aka real-time voice transcription.

You can also take this a step further and use this exact setup for a local-only co-pilot style code autocomplete and chat using Twinny. I use this every day. It's free, private, and offline.

https://github.com/twinnydotdev/twinny

Local LLMs are the only future worth living in.

vunderba

a year ago

If you're gonna go with a VS code extension and you're aiming for privacy, then I would at least recommend using the open source fork VS Codium.

https://vscodium.com/

unethical_ban

a year ago

It is true that VS Code has some non-optional telemetry, and if VS Codium works for people, that is great. However, the telemetry of VSCode is non-personal metrics, and some of the most popular extensions are only available with VSCode, not with Codium.

wkat4242

a year ago

> and some of the most popular extensions are only available with VSCode, not with Codium

Which is an artificial restriction from MS that's really easily bypassed.

Personally I don't care whether the telemetry is identifiable. I just don't want it.

noman-land

a year ago

How is it bypassed?

wkat4242

a year ago

There's a whitelist identifier that you can add bundle IDs to, to get access to the more sensitive APIs. Then you can download the extension file and install it manually. I don't have the exact process right now but just Google it :)

jjnoakes

a year ago

It is not just a technical limitation, it is a license limitation too, for what it is worth.

arcanemachiner

a year ago

Good to know. I never got too far with VSCodium because of this limitation.

SoothingSorbet

a year ago

> However, the telemetry of VSCode is non-personal metrics

I don't care, I don't want my text editor to send _any_ telemetry, _especially_ without my explicit consent.

> some of the most popular extensions are only available with VSCode

This has never been an issue for me, fortunately. The only issue is Microsoft's proprietary extensions, which I have no interest in using either. If I wanted a proprietary editor I'd use something better.

aftbit

a year ago

I dropped VSCode when I found out that the remote editing and language server extensions were both proprietary. Back to vim and sorry I strayed.

causal

a year ago

Making the remote editing extension closed is particularly frustrating, as you have little visibility into what it's doing and it is impossible to debug obscure errors

all2

a year ago

Jetbrains is pretty ok on this front. I've been enjoying using my beefy computer to do work from my potato laptop.

prometheon1

a year ago

The way I read it, the message you replied to was a complaint about parts of VSCode being proprietary. Do you mean to say Jetbrains is pretty ok on the "not being proprietary" front?

aftbit

a year ago

Yeah, 100%. I'm not a hardcore FOSS only person, but for my core workflow, when a FOSS tool exists and works well, I am not likely to use a proprietary alternative if I can avoid it at all.

So yeah, I'll use Excel to interoperate with fancy spreadsheets, but if LibreOffice will do the job, I'll use it instead. I tried out several of the fancy proprietary editors at various times (SublimeText, VSCode, even Jetbrains), but IMO they were not better _enough_ to justify switching away from something like vim, which is both ubiquitously available and FOSS.

poincaredisk

a year ago

>the telemetry of VSCode is non-personal metrics

But I don't want it. I want my software to work for me, not against me.

>and some of the most popular extensions are only available with VSCode, not with Codium.

I'll manage without them. What's especially annoying is that this restriction is completely artificial.

Having said that, MS did a great job with VsCode and I applaud them for that. I guess nothing is perfect, and I bet these decisions were made by suits against engineer wishes.

neoberg

a year ago

> But I don't want it. I want my software to work for me, not against me.

How is said software working "against" you by collecint non-personal telemetry while purpose of that telemetry usually is making the software better for most users?

noman-land

a year ago

You just need to swap out some nouns and the offense will become more obvious.

"How is that chair working 'against' you by collecting 'non-personal' sitting patterns tagged with timestamps and information about the chair and house that it's in while the purpose of that data collection 'usually' is making the chair better for other people?"

When I use a product, I'm not implicitly inviting the makers of that product to perpetually monitor my usage of the product so that they can make more money based on my data. In any other part of life other than software, this would be an obscene assumption for a product maker to make. But in software, people give it a pass.

No.

This type of data collection is obscene when informed consent is not clearly and authoritatively acquired in advance.

akimbostrawman

a year ago

>usually is making the software better for most users?

that usually hasn't been the case since at least a decade. it's truly bewildering that someone especially on hackernews would voluntarily give big tech there finger and not expect to get bitten.

tripzilch

a year ago

Software did absolutely not get any better after corporations started adding telemetry to their software.

Point in case: Software actually got worse.

Second point in case: Great software and editors have been built without telemetry for decades.

spl757

a year ago

Why did you use quotataion marks around that particular word?

user

a year ago

[deleted]

qwezxcrty

a year ago

From the documentation (https://code.visualstudio.com/docs/getstarted/telemetry ) it seems there is a supported way to completely turn off telemetry. Is there something else in VSCode that doesn't respect this setting?

lr1970

a year ago

From the documentation you linked above:

> extensions may be collecting their own usage data and are not controlled by the telemetry.telemetryLevel setting. Consult the specific extension's documentation to learn about its telemetry reporting and whether it can be disabled.

ENGNR

a year ago

That's new. Previously there was a setting, but they removed it, and it would even throw a warning in settings.json that the property no longer existed.

They must have reintroduced the telemetry setting. I can't remember if I deleted the old one, but my setting on that new value was set to "all" by default.

metadat

a year ago

Not allowing end-users to disable telemetry is actually awful. The gold standard is that IP addresses are considered personally identifiable information.

jaggederest

a year ago

> However, the telemetry of VSCode is non-personal metrics,

We know from the body of work in deobfuscation that there's no such thing as "strictly anonymous metrics".

user

a year ago

[deleted]

_kidlike

a year ago

fortyseven

a year ago

This has been my go-to for all of my local LLM interaction: it easy to get going, manages all of the models easily. Nice clean API for projects. Updated regularly; works across Windows, Mac, Linux. It's a wrapper around LlamaCpp, but it's a damned good one.

brewtide

a year ago

Same here, however minimal. I've also installed openwebui so the instance has a local web interface, and then use tailscale to access my at home LAN when put and about on the cellphone. (Goes16 weather data, ollama, a speed cam setup, and esphome temp sensors around the home / property).

It's been pretty flawless, and honestly pretty darn useful here and there. The big guns go faster and do more, but I'd prefer not having every interaction logged etc.

6core 8th gen i7 I think, with a 1050ti. Old stuff. And it's quick enough on the smaller 7/8b models for sure.

kaoD

a year ago

Well this was my experience...

    User: Hey, how are you?
    Llama: [object Object]
It's funny but I don't think I did anything wrong?

AlienRobot

a year ago

2000: Javascript is webpages.

2010: Javascript is webservers.

2020: Javascript is desktop applications.

2024: Javascript is AI.

evbogue

a year ago

From this data we must conclude that within our lifetimes all matter in the universe will eventually be reprogrammed in JavaScript.

mnky9800n

a year ago

I'm not sure I want to live in that reality.

mortenjorck

a year ago

If the simulation hypothesis is real, perhaps it would follow that all the dark matter and dark energy in the universe is really just extra cycles being burned on layers of interpreters and JIT compilation of a loosely-typed scripting language.

AlienRobot

a year ago

It's fine, it will be Typescript.

heresie-dabord

a year ago

... a reality where everything in software development that was previously established as robust foundation is discarded, only to be re-learned and re-implemented less well while burning VC cash.

Jedd

a year ago

Often you'll find there's '-chat-' and '-instruct-' variants of an LLM available.

Trying to chat to an INSTRUCT model will be disappointing, much as you describe.

kaoD

a year ago

This was on their example LLaVA 1.5 7b q4 with all default parameters which does not specify chat or instruct... but after the first message it actually worked as expected so I guess it's RLHF'd for chat or chat+instruct.

I don't know if it was some sort of error on the UI or what.

Trying to interrogate it about the first message yielded no results. It just repeated back my question, verbatim, unlike the rest of the chat which was more or less chat-like :shrugh:

user

a year ago

[deleted]

HiPHInch

a year ago

Thanks for your recommendation! I just ran Llamafile for the first time with a custom prompt on my Windows machine (i5-13600KF, RX6600) and found that it performed extremely slowly and wasn't as smart as ChatGPT. It doesn't seem suitable for productive writing. Did I do something wrong, or is there a way to improve its writing performance?

noman-land

a year ago

Local models are definitely not as smart as ChatGPT but you can get pretty close! I'd consider them to be about a year behind in terms of performance compared to hosted models, which is not surprising considering the resource constraints.

I've found that you can get faster performance by choosing a smaller model and/or by using a smaller quantization. You can use other models with llamafile as well. They have some prebuilt ones:

https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file...

You can also search for other llamafiles for other models on HuggingFace by using the llamafile tag.

https://huggingface.co/models?library=llamafile&sort=trendin...

And you can download model weights directly and use them by providing an -m flag to llamafile but that's getting a bit less straightforward.

https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file...

jonnycomputer

a year ago

RAM and what GPU you have are big determinants of how fast it will run, and how smart a model you can run. A large amount of RAM and GPU memory is required for larger models without significant slowdown because its much faster if it can keep the entire model in memory. Small models range from 3-8 gigabytes, but a 70B parameter model will be 30-50 gigabytes.

neop1x

a year ago

I am running 70B models on M2 Max with 96 GB of RAM and it works very well. As HW evolves, it will become a standard

creata

a year ago

Out of curiosity, what degree of quantization are you applying to these 70B models?

neop1x

a year ago

Q4_K_S. While not as good as top commercial models like chatgpt, they are still quite capable and I like that there are also uncensored/abliterated models like Dolphin.

xyc

a year ago

If anyone is interested in trying local AI, you can give https://recurse.chat/ a spin.

It lets you use local llama.cpp without setup, chat with PDF offline and provides chat history / nested folders chat organization, and can handle thousands of conversations. In addition you can import your ChatGPT history and continue chats with local AI.

giancarlostoro

a year ago

I dont see any indication that it runs on Linux, I'll stick to Jan which is free.

ryukoposting

a year ago

Not only are they the only future worth living in, incentives are aligned with client-side AI. For governments and government contractors, plumbing confidential information through a network isn't an option, let alone spewing it across the internet. It's a non-starter, regardless of the productivity bumps stuff like Copilot can provide. The only solution is to put AI compute on a cleared individual's work computer.

creata

a year ago

Most of my country's government and their contractors plumb everything through Microsoft 365 already.

CooCooCaCha

a year ago

> plumbing confidential information through a network isn't an option

So do you think government doesn't use networks?

ryukoposting

a year ago

Sneakernets are pervasive in environments that handle classified information. If something like that gets moved through a network, it's rarely leaving one physical room unless there's some seriously exotic hardware involved - "general dynamics MLS" is a great search prompt if you're curious what that looks like.

wkat4242

a year ago

Yeah I set up a local server with a strong GPU but even without that it's ok, just a lot slower.

The biggest benefits for me are the uncensored models. I'm pretty kinky so the regular models tend to shut me out way too much, they all enforce this prudish victorian mentality that seems to be prevalent in the US but not where I live. Censored models are just unusable to me which includes all the hosted models. It's just so annoying. And of course the privacy.

It should really be possible for the user to decide what kind of restrictions they want, not the vendor. I understand they don't want to offer violent stuff but 18+ topics should be squarely up to me.

Lately I've been using grimjim's uncensored llama3.1 which works pretty well.

threecheese

a year ago

Any tips you can give for like minded folks? Besides grimjim (checking it out).

wkat4242

a year ago

Well you can use some jailbreak prompts but with cloud models it's a cat and mouse game as they constantly fix known jailbreaks. With local models this isn't a problem of course. But I prefer getting a fine-tune model so I don't have to cascade prompts.

Not all uncensored models are great. Some return very sparse data or don't return the end tags sometimes so they keep hallucinating and never finish.

If you import grimjim's model, make sure you use the complete modelfile from vanilla lama3.1, not just an empty modelfile. Because he doesn't provide one. This really helps setting the correct parameters so the above doesn't happen so much.

But I have seen it happen with some official ollama models like wizard-vicuna and dolphin-llama. They come with modelfiles so they should be correct.

wkat4242

a year ago

@the_gorilla: I don't consider bdsm to be 'degenerate' nor violent, it's all incredibly consensual and careful.

It's just that the LLMs trigger immediately on minor words and shut down completely.

a-dub

a year ago

idk, i have a pretty powerful laptop with 16GB VRAM and a 3080ti. when i've played with quantized llama2 and llama3 (with llamacpp), it was kinda underwhelming. inference was slow, the laptop would heat up and the results weren't as good. (is llama3.1 better?)

this was with 4bit quantization and offload of as many layers as possible to the gpu.

mafuy

a year ago

(A brief note: While not weak, the laptop version of a 3080 Ti is far surpassed by even just a desktop 4060 Ti, which is sold for less than 400$. So it's possible to setup a stronger system relatively cheaply. What's good enough depends on the expectations.)

wiether

a year ago

Unless you have special needs like very high usage, privacy or other ones depicted in the article, buying another computer many hundred dollars for the unique purpose of running local models is a hard sell.

If you use their API instead of their sub-based offers, the most popular models are cheap to use and with BYOK tools, switching model is as easy as entering another string in a form.

For instance I put $15 on my OpenAI account in August 2023, since then I used Dall-E weekly and I still got more than $5 credit left!

a-dub

a year ago

it seemed to me that the bottleneck mostly revolved around the layers that were in system ram and that a lack of vram was really the gating factor in terms of reasonable inference performance. (although i would imagine that there's probably some more optimization that could be done to make best use of a split vram/sysram setup.)

in any event it was fun to try out, but still didn't seem anywhere near how well the hosted models work. a heavy duty workstation with a bunch of gpus/vram would probably be a different story though.

froggit

a year ago

> it seemed to me that the bottleneck mostly revolved around the layers that were in system ram and that a lack of vram was really the gating factor in terms of reasonable inference performance. (although i would imagine that there's probably some more optimization that could be done to make best use of a split vram/sysram setup.)

You could try a model that fits entirely into in VRAM. It"s a trade of precision for a decent bit of performance. 16GB is plenty to work with as i've seen acceptable enough results with 7B models on my 8GB GPU.

zelphirkalt

a year ago

Many setup rely on Nvidia GPUs, Intel stuff, Windows or other stuff, that I would rather not use, or are not very clear about how to set things up.

What are some recommendations for running models locally, on decent CPUs and getting good valuable output from them? Is that llama stuff portable across CPUs and hardware vendors? And what do people use it for?

threecheese

a year ago

Have you tried a Llamafile? Not sure what platform you are using. From their readme:

  > … by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executable (called a "llamafile") that runs locally on most computers, with no installation.
Low cost to experiment IMO. I am personally using MacOS with an M1 chip and 64gb memory and it works perfectly, but the idea behind this project is to democratize access to generative AI and so it is at least possible that you will be able to use it.

narrator

a year ago

With 64GB can you run the 70B size llama models well?

threecheese

a year ago

I should have qualified the meaning of “works perfectly” :) No 70b for me, but I am able to experiment with many quantized models (and I am using a Llama successfully, latency isn’t terrible)

credit_guy

a year ago

No, you can't. I have 128 GB and a 70B llamafile is unusable.

noman-land

a year ago

llamafile will run on all architectures because it is compiled by cosmopolitan.

https://github.com/jart/cosmopolitan

"Cosmopolitan Libc makes C a build-once run-anywhere language, like Java, except it doesn't need an interpreter or virtual machine. Instead, it reconfigures stock GCC and Clang to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS with the best possible performance and the tiniest footprint imaginable."

I use it just fine on a Mac M1. The only bottleneck is how much RAM you have.

I use whisper for podcast transcription. I use llama for code complete and general q&a and code assistance. You can use the llava models to ingest images and describe them.

user

a year ago

[deleted]

RevEng

a year ago

What do you want to use it on?

Ollama works on anything: Windows, Linux, Mac and Nvidia or AMD. I don't know if other cards like Arc are supported by anything yet, bit of it supports the open Vulkan API (like AMD) then it should work.

Every inference server out there supports running from CPU, but realize that it's much slower than running on a GPU - that's why this revolution didn't begin until GPUs became powerful and affordable.

As far as being clear to setup, Ollama is trivial: it's a single command line that only asks what model you want and they provide you with a list on their website. They even have a Docker container if you don't want to worry about installing any dependencies. I don't know what could be easier than that.

Most other tools like LM Studio or Jan are just a fancy UI running llama.cpp as their server and using HuggingFace to download the models. They don't even offer anything beyond simple inference, such as RAG or agents.

I've yet to see anything more than a simple RAG that's available to use out of the box for local use. The only full service tools are online services like Microsoft Copilot or ChatGPT. Anyone else who wants to do that more advanced kind of system ends up writing their own code. It's not hard if you know Python - there are lots of libraries available like HuggingFace, LangChain, and Llama-Index, as well as millions of tutorials (every blog has one).

Maybe that's a sign that there's room for an open source platform for this kind of thing, but given that it's a young field and everyone is rushing to become the next big online service or toolkit, there might not be as much interest from developers to build an open source version of a high quality online service.

distances

a year ago

I'm using Ollama with an AMD GPU (7800, 16GB) on Linux. Works out of the box. Another question is then if I get much value out of these local models.

wkat4242

a year ago

Not really. I run ollama on an AMD Radeon Pro and it works great.

For tooling to train models it's a bit more difficult but inference works great on AMD.

My CPU is an AMD Ryzen and the OS Linux. No problem.

I use OpenWebUI as frontend and it's great. I use it for everything that people use GPT for.

senkora

a year ago

> For anyone who hasn't tried local models because they think it's too complicated or their computer can't handle it

I have now learned that my laptop is capable of a whopping 0.37 tokens per second.

11th Gen Intel® Core™ i7-1185G7 @ 3.00GHz × 8

hmottestad

a year ago

Probably need to try a smaller model :P

When the article says that researchers are using their laptops those researchers are either using very small models on a gaming laptop or they have a fairly modern MacBook with a lot of ram.

There are also options for running open LLMs in the cloud. Groq (not to be confused with Grok) runs Llama, Mixtral and Gemma models really cheaply: https://groq.com/pricing/

senkora

a year ago

I'll play around with it some more later. I was running llava-v1.5-7b-q4.llamafile which is the example that they recommend trying first at https://github.com/Mozilla-Ocho/llamafile

Groq looks interesting and might be a better option for me. Thank you.

amelius

a year ago

How do we rate whether the smaller models are any good? How many questions do we need to ask it to know that it can be trusted and we didn't waste our time on it?

soco

a year ago

Fully agree with you, it should, but after trying a few times different llamas I think they're very far from "try it out in just a few moments". Unless all you want is to see one running, for anything beyond that you'll be in dependency hell...

user

a year ago

[deleted]

m3affan

a year ago

But how good are these models compared to gpt4o? My last experience with llama2-8b was not great at all. Are there really that good models that would fit on an average consumer hardware (mine has already 32GB ram and 16GB vram)?

privacyis1mp

a year ago

I built Fluid app exactly with that in mind. You can run local AI on mac without really knowing what an LLM/ollama is. Plug&Play.

Sorry for the blatant ad, though I do hope it's useful for some ppl reading this thread: https://getfluid.app

upcoming-sesame

a year ago

I just tried now. Super easy indeed but slow to the point it's not usable on my PC

heyoni

a year ago

Isn’t there also some Firefox AI integration that’s being tested by one dev out there? I forgot the name and wonder if it got any traction.

ComputerGuru

a year ago

Do you know if whisperfile is akin to whisper or the much better whisperx? Does it do diarization?

christkv

a year ago

I recommend llmstudio for this usually

user

a year ago

[deleted]

toddmorey

a year ago

I narrate notes to myself on my morning walks[1] and then run whisper locally to turn the audio into text... before having an LLM clean up my ramblings into organized notes and todo lists. I have it pretty much all local now, but I don't mind waiting a few extra seconds for it to process since it's once a day. I like the privacy because I was never comfortable telling my entire life to a remote AI company.

[1] It feels super strange to talk to yourself, but luckily I'm out early enough that I'm often alone. Worst case, I pretend I'm talking to someone on the phone.

vunderba

a year ago

Same. My husky/pyr mix needs a lot of exercise, so I'm outside a minimum of a few hours a day. As a result I do a lot of dictation on my phone.

I put together a script that takes any audio file (mp3, wav), normalizes it, runs it through ggerganov's whisper, and then cleans it up using a local LLM. This has saved me a tremendous amount of time. Even modestly sized 7b parameter models can handle syntactical/grammatical work relatively easily.

Here's the gist:

https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

EDIT: I've always talked out loud through problems anyway, throw a BT earbud on and you'll look slightly less deranged.

schmidtleonard

a year ago

Button-toggled voice notes in the iPhone Notes app are a godsend for taking measurements. Rather than switching your hands between probe/equipment and notes repeatedly, which sucks badly, you can just dictate your readings and maaaaybe clean out something someone said in the background. Over the last decade, the microphones + speech recognition became Good Enough for this. Wake-word/endpoint models still aren't there yet, and they aren't really close, but the stupid on/off button in the Notes app 100% solves this problem and the workflow is now viable.

I love it and I sincerely hope that "Apple Intelligence" won't kill the button and replace it with a sub-viable conversational model, but I probably ought to figure out local whisper sooner rather than later because it's probably inevitable.

jxcl

a year ago

This has inspired me.

I do a lot of stargazing and have experimented with voice memos for recording my observations. The problem of course is later going back and listening to the voice memo and getting organized information out of what essentially turns into me rambling to myself.

I'm going to try to use whisper + AI to transcribe my voice memos into structured notes.

alyandon

a year ago

I would be greatly interested in knowing how you set all that up if you felt like sharing the specifics.

neom

a year ago

This is exactly why I think the AI pins are a good idea. The Humane pin seems too big/too expensive/not quite there yet, but for exactly what you're doing, I would like some type of brooch.

hdjjhhvvhga

a year ago

> It feels super strange to talk to yourself

I remember the first lecture in the Theory of Communication class where the professor introduced the idea that communication by definition requires at least two different participants. We objected by saying that it can perfectly be just one and the same participant (communication is not just about space but also time), and what you say is a perfect example of that.

lukan

a year ago

"before having an LLM clean up my ramblings into organized notes and todo lists."

Which local LLM do you use?

Edit:

And self talk is quite a healthy and useful thing in itself, but avoiding it in public is indeed kind of necessary, because of the stigma

https://en.m.wikipedia.org/wiki/Intrapersonal_communication

vincvinc

a year ago

I was thinking about making this the other day. Would you mind sharing what you used?

racked

a year ago

What software did you use to set all this up? Kindof interested in giving this a shot myself.

wkat4242

a year ago

What do you use to run whisper locally? I don't think ollama can do it.

fivestones

a year ago

Can you give any more details on your setup? This sounds great

yieldcrv

a year ago

I found one that can isolate speakers, its just okay at that

pella

a year ago

Next year, devices equipped with AMD's Strix Halo APU will be available, capable of using ~96GB of VRAM across 4 relatively fast channels from a total of 128GB unified memory, along with a 50 TOPS NPU. This could partially serve as an alternative to the MacBook Pro models with M2/M3/M4 chips, featuring 128GB or 196GB unified memory.

- https://videocardz.com/newz/amd-ryzen-ai-max-395-to-feature-...

aurareturn

a year ago

It will have around 250GB/s of bandwidth which makes it nearly unusable for 70b models. So the high amount of RAM doesn’t help with large models.

jstummbillig

a year ago

Also, next year, there will be GPT 5. I find it fascinating how much attention small models get, when at the same time the big models just get bigger and prohibitively expensive to train. No leading lab would do that if they thought it a decent chance that small models were able to compete.

So who will be interested in a shitty assistant next year when you can have an amazing one, is what I wonder? Is this just the biggest cup of wishful thinking that we have ever seen?

wazdra

a year ago

I'd like to point out that llama 3.1 is not open source[1] (I was recently made aware of that fact by [2], when it was on HN front page) While it's very nice to see a peak of interest for local, "open-weights" LLMs, this is an unfortunate choice of words, as it undermines the quite important differences between llama's license model and open-source. The license question does not seem to be addressed at all in the article.

[1]: https://www.llama.com/llama3_1/license/

[2]: https://csvbase.com/blog/14

gptisms

a year ago

It's not an open source licence, but to Meta's credit it's also not a licence you have to sell your soul to use. No more than 700 million monthly active users without requesting a licence, and don't use the model for illegal purposes and follow an AUP. It's as good as open source if you're using the models locally. If you're serious about building a commercial product with llama the AUP is almost certainly going to line up with your own terms of service for your product. Compare this with other models that are also billed as 'open source':

yi: previously non-commercial but Apache 2.0 now?

deepseek: usage policy

larger gemma models: usage policy

databricks: similar to llama: no more than 700 million MAUs, usage policy, additional restrictions on using outputs to train other models

qwen: no more than 100 million MAUs

mistral: non-commercial

command-r: CC BY-NC

starling: CC BY-NC

There are a handful of niche models released under MIT/Apache but the norm is licences similar to or more restrictive than the Llama Community Licence, and I really doubt the situation would be better if Meta wasn't first.

>"open-weights" LLMs

I doubt this is the point you're making, but the training data really isn't useful even if it could be released under a permissive licence. Most models use similar datasets: reddit (no licence afaik, copyright belongs to comment authors), stackoverflow (CC BY-SA), wikipedia (CC BY-SA), Project Gutenberg (public domain?), previously books3 (books under copyright by publishers with more money and lawyers than reddit users), etc, with various degrees of filtering to remove harmful data. You can't do much with this much data unless you have millions of dollars worth of compute laying around, and you can't rebuild llama any more than any other company using the same data have 'rebuilt llama' - all models trained in a similar manner on the same data are going to converge in outputs eventually. Compare with Linux distributions, they all use the same packages but you're not going to get the same results.

sergiotapia

a year ago

that ship sailed 13 years ago dude.

albertgoeswoof

a year ago

I have a three year old M1 Max, 32gb RAM. Llama 8bn runs at 25 tokens/sec, that’s fast enough, and covers 80% of what I need. On my ryzen 5600h machine, I get about 10 tokens/second, which is slow enough to be annoying.

If I get stuck on a problem, switch to chat gpt or phind.com and see what that gives. Sometimes, it’s not the LLM that helps, but changing the context and rewriting the question.

However I cannot use the online providers for anything remotely sensitive, which is more often than you might think.

Local LLMs are the future, it’s like having your own private Google running locally.

fsmv

a year ago

A small model necessarily is missing many facts. The large model is the one that has memorized the whole internet, the small one is just trained to mimic the big one.

You simply cannot compress the whole internet under 10gb without throwing out a lot of information.

Please be careful about what you take as fact coming from the local model output. Small models are better suited to summarization.

staticman2

a year ago

I'm really curious what you are doing with an LLM that can be solved 80% of the time with a 8b model.

meiraleal

a year ago

We need browser and OS level API (mobile) integration to the local LLM.

leshokunin

a year ago

I like self hosting random stuff on docker. Ollama has been a great addition. I know it's not, but it feels on par with ChatGPT.

It works perfectly on my 4090, but I've also seen it work perfectly on my friend's M3 laptop. It feels like an excellent alternative for when you don't need the heavy weights, but want something bespoke and private.

I've integrated it with my Obsidian notes for 1) note generation 2) fuzzy search.

I've used it as an assistant for mental health and medical questions.

I'd much rather use it to query things about my music or photos than whatever the big players have planned.

ekabod

a year ago

Ollama is not a model, it is the sofware to run models.

exe34

a year ago

which model are you using? what size/quant/etc?

thanks!

Anunayj

a year ago

I recently experimented with running llama-3.1-8b-instruct locally on my Consumer hardware, aka my Nvidia RTX 4060 with 8GB VRAM, as I wanted to experiment with prompting pdfs with a large context which is extremely expensive with how LLMs are priced.

I was able to fit the model with decent speeds (30 tokens/seconds) and a 20k token context completely on the GPU.

For summarization, the performance of these models are decent enough. However unfortunately in my use case I felt using Gemini's Free Tier with it's multimodal capabilities and much better quality output made running local LLMs not really worth it as of right now, atleast for consumers.

mistrial9

a year ago

you moved the goalposts when you add 'multimodal' there; another item is, no one reads PDF tables and illustrations perfectly, at any price AFAIK

RevEng

a year ago

I'm currently working on an LLM-based product for a large company that's used in circuit design. Our customers have very strict confidentiality requirements since the field is very competitive and they all have trade secret technologies that give them significant competitive advantages. Using something public like ChatGPT is simply out of the question. Their design environments are often completely disconnected from the public internet, so our tools need to run local models. Llama 3 has worked well for us so far and we're looking at other models too. We also prefer not being locked in to a specific vendor like OpenAI, since our reliance on the model puts us in a poor position to negotiate and the future of AI companies isn't guaranteed.

For my personal use, I also prefer to use local models. I'm not a fan of OpenAI's shenanigans and Google already abuses its customers data. I also want the ability to make queries on my own local files without having to upload all my information to a third party cloud service.

Finally, fine tuning is very valuable for improving performance in niche domains where public data isn't generally available. While online providers do support fine tuning through their services, this results in significant lock in as you have to pay them to do the tuning in their servers, you have to provide them with all your confidential data, and they own the resulting model which you can only use through their service. It might be convenient at first, but it's a significant business risk.

McBainiel

a year ago

> Microsoft used LLMs to write millions of short stories and textbooks in which one thing builds on another. The result of training on this text, Bubeck says, is a model that fits on a mobile phone but has the power of the initial 2022 version of ChatGPT.

I thought training LLMs on content created by LLMs was ill-advised but this would suggest otherwise

andai

a year ago

Look into Microsoft's Phi papers. The whole idea here is that if you train models on higher quality data (i.e. textbooks instead of blogspam) you get higher quality results.

The exact training is proprietary but they seem to use a lot of GPT-4 generated training data.

On that note... I've often wondered if broad memorization of trivia is really a sensible use of precious neurons. It seems like a system trained on a narrower range of high quality inputs would be much more useful (to me) than one that memorized billions of things I have no interest in.

At least at the small model scale, the general knowledge aspect seems to be very unreliable anyways -- so why not throw it out entirely?

kkielhofner

a year ago

Synthetic data (data from some kind of generative AI) has been used in some form or another for quite some time[0]. The license for LLaMA 3.1 has been updated to specifically allow its use for generation of synthetic training data. Famously, there is a ToS clause from OpenAI in terms of using them for data generation for other models but it's not enforced ATM. It's pretty common/typical to look through a model card, paper, etc and see the use of an LLM or other generative AI for some form of synthetic data generation in the development process - various stages of data prep, training, evaluation, etc.

Phi is another really good example but that's already covered from the article.

[0] - https://www.latent.space/i/146879553/synthetic-data-is-all-y...

moffkalast

a year ago

As others point out, it's essentially distillation of a larger model to a smaller one. But you're right, it doesn't work very well. Phi's performance is high on benchmarks but not nearly as good in actual real world usage. It is extremely overfit on a narrow range of topics in a narrow format.

mrbungie

a year ago

I would guess correctly aligned and/or finely filtered synthetic data coming from LLMs may be good.

Mode colapse theories (and simplified models used as proof of existence of said problem) assume affected LLMs are going to be trained with poor quality LLM-generated batches of text from the internet (i.e. reddit or other social networks).

sandwichmonger

a year ago

That's the number one way of getting mad LLM disease. Feeding LLMs to LLMs.

gugagore

a year ago

Generally (not just for LLMs) this is called student-teacher training and/or knowledge distillation.

staticman2

a year ago

There's been efforts to train small LLM's on bigger LLM's. Ever since Llama came out the community was creating custom fine tunes this way using ChatGPT.

brap

a year ago

I think it can be a tradeoff to get to smaller models. Use larger models trained on the whole internet to produce output that would train the smaller model.

iJohnDoe

a year ago

> Microsoft used LLMs to write millions of short stories and textbooks

Millions? Where are they? Where are they used?

vessenes

a year ago

All this will be an interesting side note in the history of language models in the next eight months when roughly 1.5 billion iPhone users will get a local language model tied seamlessly to a mid-tier cloud based language model native in their OS.

What I think will be interesting is seeing which of the open models stick around and for how long when we have super easy ‘good enough’ models that provide quality integration. My bet is not many, sadly. I’m sure Llama will continue to be developed, and perhaps Mistral will get additional European government support, and we’ll have at least one offering from China like Qwen, and Bytedance and Tencent will continue to compete a-la Google and co. But, I don’t know if there’s a market for ten separately trained open foundation models long term.

I’d like to make sure there’s some diversity in research and implementation of these in the open access space. It’s a critical tool for humans, and it seems possible to me that leaders will be able to keep extending the gap for a while; when you’re using that gap not just to build faster AI, but do other things, the future feels pretty high volatility right now. Which is interesting! But, I’d prefer we come out of it with people all over the world having access to these.

jannyfer

a year ago

> in the next eight months when roughly 1.5 billion iPhone users will get a local language model tied seamlessly to a mid-tier cloud based language model native in their OS.

Only iPhone 15 Pro or later will get Apple Intelligence, so the number will be wayyy smaller.

darby_nine

a year ago

I expect people will just ship with their own model where the built-in one isn't sufficient.

When people describe it as a "critical tool" i feel like I'm missing basic information about how people use computers and interact with the world. In what way is it critical for anything? It's still just a toy at this point.

Archit3ch

a year ago

One use case I've found very convenient: partial screenshot |> minicpm-v

Covers 90% of OCR needs with 10% of the effort. No API keys, scripting, or network required.

wslh

a year ago

What's the current cost of building a DIY bare-bones machine setup to run the top LLaMA 3.1 models? I understand that two nodes are typically required for this. Has anyone built something similar recently, and what hardware specs would you recommend for optimal performance? Also, do you suggest waiting for any upcoming hardware releases before making a purchase?

atemerev

a year ago

405B is beyond homelab-scale. I recently obtained a 4x4090 rig, and I am comfortable running 70B and occasionally 128B-class models. For 405B, you need 8xH100 or better. A single H100 costs around $40k.

api

a year ago

I use ollama through a Mac app called BoltAI quite a bit. It’s like having a smart portable sci-fi “computer assistant” for research and it’s all local.

It is about the only thing I can do on my M1 Pro to spin up the fans and make the bottom of the case hot.

Llama3.1, Deepseek Coder v2, and some of the Mistral models are good.

ChatGPT and Claude top tier models are still better for very hard stuff.

drio

a year ago

It looks good. Thank you for sharing. Is anyone aware of a similar tool for Linux?

dockerd

a year ago

What spec people recommend here to run small models like Llama3.1 or mistral-nemo etc.

Also is it sensible to wait for newer mac, amd, nvidia hardware releasing soon?

freeone3000

a year ago

M4s are releasing in probably a month or two; if you’re going Apple, it might be worth waiting for either those or the price drop on the older models.

noman-land

a year ago

You basically need as much RAM as the size of the model.

andai

a year ago

What local models is everyone using?

The last one I used was Llama 3.1 8B which was pretty good (I have an old laptop).

Has there been any major development since then?

moffkalast

a year ago

Qwen 2.5 has just released, with a surprising amount of sizes. The 14B and 32B look pretty promising for their size class but it's hard to tell yet.

demarq

a year ago

Nada to be honest. I keep trying every new model, and invariably go back to llama 8b.

Llama8b is the new mistral.

oneshtein

a year ago

codestral 22b, aya-23 35b, gemma2 27b-instruct.

simion314

a year ago

OpenAI APIs for GPT and Dalle have issues like non determnism, and their special prompt injection where they add stuff or modify your prompt (with no option to turn that off. Makes it impossible to do research or to debug as a developer variations of things.

throwaway314155

a year ago

While that's true for their ChatGPT SaaS, the API they provide doesn't impose as many restrictions.

HexDecOctBin

a year ago

May as well ask here: what is the best way to use something like an LLM as a personal knowledge base?

I have a few thousand book, papers and articles collected over the last decade. And while I have meticulously categorised them for fast lookup, it's getting harder and harder to search for the desired info, especially in categories which I might not have explored recently.

I do have a 4070 (12 GB VRAM), so I thought that LLMs might be a solutions. But trying to figure out the whats and hows hase proven to be extremely complicated, what with deluge of techniques (fine-tuning, RAG, quantisation) that might not might not be obsolete, too many grifters hawking their own startups with thin wrappers, and a general sense that the "new shiny object" is prioritised more than actual stable solutions to real problems.

routerl

a year ago

Imho opinion, and I'm no expert, but this has been working well for me:

Segment the texts into chunks that make sense (i.e. into the lengths of text you'll want to find, whether this means chapters, sub-chapters, paragraphs, etc), create embeddings of each chunk, and store the resultant vectors in a vector database. Your search workflow will then be to create an embedding of your query, and perform a distance comparison (e.g. cosine similarity) which returns ranked results. This way you can now semantically search your texts.

Everything I've mentioned above is fairly easily doable with existing LLM libraries like langchain or llamaindex. For reference, this is an RAG workflow.

pimeys

a year ago

Has anybody found a good way to utilize ollama with an editor such as zed to do things like "generate rustdoc to this method" etc. I use ollama daily for a ton of things, but for code generation, completion and documentation 4o is still much better than any of the local models...

oneshtein

a year ago

Gemma2 27b instruct:

Generate documentation in Rust format for use in code.

  /// An enum representing possible errors during CRUD operations.
  #[derive(Error, Debug)]
  enum CrudError {
    /// Indicates an error occurred while interacting with the SQLite database.
    /// 
    /// This variant wraps a `rusqlite::Error`, providing details about the specific SQLite error.
    #[error("SQLite error: {0}")]
    SqliteError(rusqlite::Error),

    /// Indicates an error related to input/output operations.
    /// 
    /// This variant wraps a `std::io::Error`, giving information about the specific IO problem encountered.
    #[error("IO error: {0}")]
    IoError(std::io::Error),
  }

navbaker

a year ago

The Continue extension for VSCode is pretty good and has native connectivity to a local install of Ollama

shahzaibmushtaq

a year ago

I need to have two things of my own that work offline for privacy concerns and cost savings:

1. Local LLM AI models with GUI and command line

2. > Local LLM-based coding tools do exist (such as Google DeepMind’s CodeGemma and one from California-based developers Continue)

teleforce

a year ago

Recently in HN there's a repost on how big data is dead (2023) and both HN posts had garnered more than a combined 800 comments and 1000 points, both very substantial in term of interests on this controversial topic [1], [2]. I think someone will post the same topic and link again next year and probably will get similar number of comments and points.

In comments sections for repost, I did mentioned that big data is not dead but just having its AI winter moment, and big data is not only about volume (of storage) but also on the memory requirements (of RAM) [3].

Fast forward a few months, it seems the AI LLM is reviving the big data scenario and is seems that big data is far from dead but it's very much healthy in the era of LLMs, cloud or local. Wait until IoT and machine-to-machine (M2M) based systems are in full effects and big data will not just surviving but will be thriving.

[1] Big data is dead (2023) - original

https://news.ycombinator.com/item?id=34694926

[2] Big data is dead (2023) - repost

https://news.ycombinator.com/item?id=40488844

[3] Comments on: Big data is dead (2023):

https://news.ycombinator.com/item?id=40489607

ein0p

a year ago

Lately I’ve been submitting the same requests to GPT 4o and Mistral Large running on my laptop or one of my own servers that has 2xA6000. I get about the same quality from both and probably would not be able to tell which is which in a blind test. There are a couple disadvantages to either of the two local solutions, however. Both of them are slower than 4o. Both of them require a substantial amount of very expensive hardware (which I happed to already have, but still). And on a laptop a single long generation can drain a few percent of the battery. Neural Engine, which would be more energy efficient cannot currently run models this large, and smaller ones do not do as well for the chatbot use case. I do also use smaller LLMs locally, for the more “narrow” tasks.

brap

a year ago

Some companies (OpenAI, Anthropic…) base their whole business on hosted closed source models. What’s going to happen when all of this inevitably gets commoditized?

This is why I’m putting my money on Google in the long run. They have the reach to make it useful and the monetization behemoth to make it profitable.

csmpltn

a year ago

There's plenty of competition in this space already, and it'll only get accelerated with time. There's not enough "moat" in building proprietary LLMs - you can tell by how the leading companies in this space are basically down to fighting over patents and regulatory capture (ie. mounting legal and technical barriers to scraping, procuring hardware, locking down datasets, releasing less information to the public about how the models actually work behind the scenes, lobbying for scary-yet-vague AI regulation, etc).

It's fizzling out.

The current incumbents are sitting on multi-billion dollar valuations and juicy funding rounds. This buys runtime for a good couple of years, but it won't last forever. There's a limit to what can be achieved with scraped datasets and deep Markov chains.

Over time, it will become difficult to judge what makes one general-purpose LLM be any better than another general-purpose LLM. A new release isn't necessarily performing better or producing better quality results, and it may even regress for many use-cases (we're already seeing this with OpenAI's latest releases).

Competitors will have caught up to eachother, and there shouldn't be any major differences between Claude, ChatGPT, Gemini, etc - after-all, they should all produce near-identical answers, given identical scenarios. Pace of innovation flattens out.

Eventually, the technology will become wide-spread, cheap and ubiquitous. Building a (basic, but functional) LLM will be condensed down to a course you take at university (the same way people build basic operating systems and basic compilers in school).

The search for AGI will continue, until the next big hype cycle comes up in 5-10 years, rinse and repeat.

You'll have products geared at lawyers, office workers, creatives, virtual assistants, support departments, etc. We're already there, and it's working great for many use-cases - but it just becomes one more tool in the toolbox, the way Visual Studio, Blender and Photoshop are.

The big money is in the datasets used to build, train and evaluate the LLMs. LLMs today are only as good as the data they were trained on. The competition on good, high-quality, up-to-date and clean data will accelerate. With time, it will become more difficult, expensive (and perhaps illegal) to obtain world-scale data, clean it up, and use it to train and evaluate new models. This is the real goldmine, and the only moat such companies can really have.

whimsicalism

a year ago

Their hope is to reach AGI and effective post-scarcity for most things that we currently view as scarce.

I know it sounds crazy but that is what they actually believe and is a regular theme of conversations in SF. They also think it is a flywheel and whoever wins the race in the next few years will be so far ahead in terms of iteration capability/synthetic data that they will be the runaway winner.

throwaway314155

a year ago

I don't have a horse in the race but wouldn't Meta be more likely to commoditize things given that they sort of already are?

aledalgrande

a year ago

Does anyone know of a local "Siri" implementation? Whisper + Llama (or Phi or something else), that can run shortcuts, take notes, read web pages etc.?

PS: for reading web pages I know there's voices integrated in the browser/OS but those are horrible

gardnr

a year ago

Edit: I just found this. I'll give it a try today: https://github.com/0ssamaak0/SiriLLama

---

Open WebUI has a voice chat but the voices are not great. I'm sure they'd love a PR that integrates StyleTTS2.

You can give it a Serper API Key and it will search the web to use as context. It connects to ollama running on a linux box with a $300 RTX 3060 with 12GB of VRAM. The 4bit quant of Llama 3.1 8B takes up a bit more than 6GB of VRAM which means it can run embedding models and STT on the card at the same time.

12GB is the minimum I'd recommend for running quantized models. The RTX 4070 Ti Super is 3x the cost but 7 times "faster" on matmuls.

The AMD cards do inference OK but they are a constant source of frustration when trying to do anything else. I bought one and tried for 3 months before selling it. It's not worth the effort.

I don't have any interest in allowing it to run shortcuts. Open WebUI has pipelines for integrating function calling. HomeAssistant has some integrations if that's the kind of thing you are thinking about.

xenospn

a year ago

Apple intelligence?

mrfinn

a year ago

It's kinda funny how nowadays an AI with 8 billion parameters is something "small". Specially when just two years back entire racks were needed to run something giving way worst performance.

atemerev

a year ago

IDK, 8B-class quantized models run pretty fast on commodity laptops, with CPU-only inference. Thanks to the people who figured out quantization and reimplemented everything in C++, instead of academic-grade Python.

pilooch

a year ago

I run a fineruned mulmodal LLM as a spam filter (reads emails as images). Game changer. Removes all the stuff I wouldn't read anyways, not only spam.

user

a year ago

[deleted]

nipponese

a year ago

Am I the only one seeing obvious ads in llama3 results?

Sophira

a year ago

I've not yet used any local AI, so I'm curious - what are you getting? Can you share examples?

tripzilch

a year ago

it doesn't fill me with confidence that an MD calls themselves a "ML researcher" when clearly they've only been "into ML" since recently LLMs became popular. I could believe that this MD became an "expert in using LLMs" in the past few years (like, we all became), but this guy is training networks based on medical data, without experience in ML going back further than 2022. a little knowledge is dangerous, there's a whole bunch of boring stuff I'd want someone to learn about ML before going "hey LLMs are cool for medical use" ...

alanzhuly

a year ago

For anyone looking for a simple alternative for running local models beyond just text, Nexa AI has built an SDK that supports text, audio (STT, TTS), image generation (e.g., Stable Diffusion), and multimodal models! It also has a model hub to help you easily find local models optimized for your device.

Nexa AI local model hub: https://nexaai.com/ Toolkit: https://github.com/NexaAI/nexa-sdk

It also comes with a built-in local UI to get started with local models easily and OpenAI-compatible API (with JSON schema for function calling and streaming) for starting local development easily.

You can run the Nexa SDK on any device with a Python environment—and GPU acceleration is supported!

Local LLMs, and especially multimodal local models are the future. It is the only way to make AI accessible (cost-efficient) and safe.

atentaten

a year ago

I'm interested in running locally, but I haven't found consistent advice on hardware specs for optimal performance. I would like to build a server with the best GPU and tons of RAM to run and experiment with these models.

Havoc

a year ago

Either get a Mac with loads of mem or build a rig with a 3090 or three

swah

a year ago

I saw this demo a few months back - and lost it, of LLM autocompletion that was a few milliseconds - it opened a how new way on how to explore it... any ideas?

shrubble

a year ago

The newest laptops are supposed to have 40-50 TOPS performance with the new AI/NPU features. Wondering what that will mean in practice.

jsemrau

a year ago

The Mistral models are not half as bad for this.

trash_cat

a year ago

I think it REALLY depends on your use case. Do you want to brainstorm, clear out some thoughts, search or solve complex tasks?

sandspar

a year ago

What advantages do local models have over exterior models? Why would I run one locally if ChatGPT works well?

pieix

a year ago

1) Offline connectivity — pretty cool to be able to debug technical problems while flying (or otherwise off grid) with a local LLM, and current 8B models are usually good enough for the first line of questions that you otherwise would have googled.

2) Privacy

3) Removing safety filters — there are some great “abliterated” models out there that have had their refusal behavior removed. Running these locally and never having your request refused due to corporate risk aversion is a very different experience to calling a safety-neutered API.

Depending on your use case some, all, or none of these will be relevant, but they are undeniable benefits that are very much within reach using a laptop and the current crop of models.

miguelaeh

a year ago

I am betting on local AI and building offload.fyi to make it easy to implement in any app

create-username

a year ago

there's no small AI that I know of and masters ancient Greek, Latin, English, German and French and that I can run on my 18 GB macbook pro.

Please correct me if I'm wrong. It would make my life slightly more comfortable

sparkybrap

a year ago

I agree. Even bi-lingual (English+1) small models would be very useful to process localized data, for ex english-french, english-german, etc.

Right now the small models (llama 8B) can't handle this type of task, although they could if they were trained for bi lingual data.

binary132

a year ago

I really get the feeling with these models that what we need is a very memory-first hardware architecture that is not necessarily the fastest at crunching.... that seems like it shouldn’t necessarily be a terrifically expensive product

jmount

a year ago

I think this is a big deal. In my opinion, many money making stable AI services are going to be deliberately of limited ability on limited domains. One doesn't want one's site help bot answering political questions. So this could really pull much of the revenue away from AI/LLMs as service.

dharma1

a year ago

What’s the best local codegen model that gets close to Claude 3.5? Deepseek coder?

Don’t want to pay for Claude API and Cursor sub gets past the limits quick. Gonna get a M4 Mac with maxed out ram when they come out next month

stainablesteel

a year ago

i think this is laughable, the only good 8B models are the llama ones, phi is terrible, even codestral can barely code and that's 22B iirc

but truthfully the 8B just aren't that great yet, they can provide some decent info if you're just investigating things but a google search is still faster

diggan

a year ago

Summary: It's cheaper, safer for handling sensitive data, easier to reproduce results (only way to be 100% sure it's reproduce even, as "external" models can change anytime), higher degree of customization, no internet connectivity requirements, more efficient, more flexible.

bionhoward

a year ago

No ridiculous prohibitions on training on logs…

Man, imagine being OpenAI and flushing your brand down the toilet with an explicit customer noncompete rule which totally backfires and inspires 100x more competition than it prevents

roywiggins

a year ago

Llama's license does forbid it:

"Llama 3.1 materials or outputs cannot be used to improve or train any other large language models outside of the Llama family."

https://llamaimodel.com/commercial-use/

jclulow

a year ago

I'm not sure why anybody would respect that licence term, given the whole field rests on the rapacious misappropriation of other people's intellectual property.

ronsor

a year ago

Meta dropped that term, actually, and that's an unofficial website.

sigmoid10

a year ago

>If you use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model name.

The official llama 3 repo still says this, which is a different phrasing but effectively equal in meaning to what the commenter above said.

alexander2002

a year ago

An AI chip on laptop devices would be amazing!

viraptor

a year ago

It's pretty much happening already. Apple devices have MPS. Both new Intels and Snapdragon X have some form of NPU.

moffkalast

a year ago

It would be great if any NPU that currently exists was any good at LLM acceleration, but they all have really bad memory bottlenecks.

ta988

a year ago

They already exist. Nvidia GPUs on laptops, M series CPUs from Apple, NPUs...

aurareturn

a year ago

First NPU arrived 7 years ago in an iPhone SoC. GPUs are also “AI” chips.

Local LLM community has been using Apple Silicon Mac GPUs to do inference.

I’m sure Apple Intelligence uses the NPU and maybe the GPU sometimes.

theodorthe5

a year ago

Local LLMs are terrible compared to Claude/ChatGPT. They are useful to use as APIs for applications: much cheaper than paying for OpenAI services, and can be fine tuned to do many useful (and less useful, even illegal) things. But for the casual user, they suck compared to the very large LLMs OpenAI/Anthropic deliver.

78m78k7i8k

a year ago

I don't think local LLM's are being marketed "for the casual user", nor do I think the casual user will care at all about running LLM's locally so I am not sure why this comparison matters.

maxnevermind

a year ago

Yep, unfortunately those local models are noticeably worse. Also models are getting bigger, so even if a local basement rig for a higher quality model is possible right now, that might not be so in the future. Also Zuck and others might stop releasing their weights for the next gen models, then what, just hope they plateau, what if they don't?

123yawaworht456

a year ago

they are the only thing you can use if you don't want to or aren't allowed to hand over your data to US corporations and intelligence agencies.

every single query to ChatGPT/Claude/Gemini/etc will be used for any purpose, by any party, at any time. shamelessly so, because this is the new normal. Welcome to 2024. I own nothing, have no privacy, and life has never been better.

>(and less useful, even illegal) things

the same illegal things you can do with Notepad, or a pencil and a piece of paper.