I have written gemma3 inference in pure C

65 pointsposted 12 days ago
by robitec97

16 Comments

austinvhuang

10 days ago

My first implementation of gemma.cpp was kind of like this.

There's such a massive performance differential vs. SIMD though that I learned to appreciate SIMD (via highway) as one sweet spot of low-dependency portability that sits between C loops and the messy world of GPUs + their fat tree of dependencies.

If anyone want to learn the basics - whip out your favorite LLM pair programmer and ask it to help you study the kernels in the ops/ library of gemma.cpp:

https://github.com/google/gemma.cpp/tree/main/ops

janwas

10 days ago

:D Your code was nicely written and it was a pleasure to port to SIMD because it was already very data-parallel.

rao-v

10 days ago

I'm really charmed by this project (I know there are a few like it).

In particular it's got a single ~600 line file (https://github.com/robitec97/gemma3.c/blob/main/gemma3_kerne...) with a clear straightforward implementation of every major function used in inferencing (google's models) from gelu to rope.

I'm curious how many more functions you'd need to add to have full coverage of every publically available LLM innovation (e.g. QK-Norm from Qwen3, SwiGLU etc.).

Obviously llama.cpp has a much bigger library but it's lovely to see everything in one clean file.

w4yai

10 days ago

> It proves that modern LLMs can run without Python, PyTorch, or GPUs.

Did we need any proof of that ?

jdefr89

10 days ago

Python and PyTorch all call out to C libraries… I don’t get what he means by “proving LLMs can run without Python and PyTorch” at all. Seems like they don’t understand basic fundamentals about things here…

jasonjmcghee

10 days ago

I guess llama.cpp isn't quite as popular as I had assumed.

christianqchung

10 days ago

A bizarre claim like that would be what happens when you let an LLM write the README without reading it first.

skybrian

10 days ago

Knowing the performance is interesting. Apparently it's 1-3 tokens/second.

tolerance

10 days ago

I imagine so regarding GPUs, right? Is this is a legitimate project then doesn’t it provide a proof of concept for performance constraints that relate to them? Couldn't the environmentally concerned take this as an indicator that the technology can progress without relying on as much energy is potentially spent now? Shouldn’t researchers in the industry be thinking of ways to prevent the future capabilities of the technology from outrunning the capacity of the infrastructure?

I know very little about AI but these are things that come to mind here for me.

user

12 days ago

[deleted]

pacman1337

9 days ago

Anyone using this model for something useful? For now I only have use cases for top performing models...

behnamoh

10 days ago

but why tho? next gemma is coming and no one uses gemma 3 in prod anyway.

uncognic

10 days ago

I think /* */ single-line comments is a pretty good indication.

NitpickLawyer

10 days ago

> no one uses gemma 3 in prod anyway.

Umm, we do. It's still one of the best for eu countries support / help chatbot style. It's got good (best?) multilingual support ootb, it's very "safe" (won't swear, won't display chinese characters, etc) and it's pretty fast.

austinvhuang

10 days ago

I don't have firsthand knowledge, but r/SesameAI seems to believe Maya/Miles products are based on a Gemma3 backbone.

data-ottawa

10 days ago

Gemma3 is probably the best supported fine tunable model.