This is the text inference issues I was alluding to. We had several hurdles to overcome. 1 llms were trained on little endian. Mips for n64 is big endian. 2 we had python to c issues. 3 we had quantization issues. all being resolved.
This is a tech demo to honor LOZ and also the code can be used for n64 devs to add ai style npcs in the future. So did we achieve it yes we are the first to do llm inference on n64. I am just trying to give you guys the proper video.
Scott
It totally is. The fact that this post has gotten this many upvotes is appalling.
Just wait sir. We are indeed doing inference on n64. We had serious issues with text. I am almost done resolving.
It's best to flag this fake garbage shit and move on.
I tried to build this but it's missing the weights.bin file and my computer is too weak to generate it. Can you add it to the repo?
Uploading weights.bin its really meant for you to generate your own llm but we are uplaoding it. They are ripping on it but they didnt check the code themselves. THis is a tech demo. its not about graphics its about the llm is inferring on the hardware lol.
Honest Limitations
819K parameters. Responses are short and sometimes odd. That's expected at this scale with a small training corpus. The achievement is that it runs at all on this hardware.
Context window is 64 tokens. Prompt + response must fit in 64 bytes.
No memory between dialogs. The KV cache resets each conversation.
Byte-level vocabulary. The model generates one ASCII character at a time.
Future Directions
These are things we're working toward — not current functionality:
RSP microcode acceleration — the N64's RSP has 8-lane SIMD (VMULF/VMADH); offloading matmul would give an estimated 4–8× speedup over scalar VR4300
Larger model — with the Expansion Pak (8MB total), a 6-layer model fits in RAM
Richer training data — more diverse corpus = more coherent responses
Real cartridge deployment — EverDrive compatibility, real hardware video coming
Why This Is Real
The VR4300 was designed for game physics, not transformer inference. Getting Q8.7 fixed-point attention, FFN, and softmax running stably at 93MHz required:
Custom fixed-point softmax (bit-shift exponential to avoid overflow)
Q8.7 accumulator arithmetic with saturation guards
Soft-float compilation flag for float16 block scale decode
Alignment-safe weight pointer arithmetic for the ROM DFS filesystem
The inference code is in nano_gpt.c. The training script is train_sophia_v5.py. Build it yourself and verify.
The sgai_rsp_matmul_q4() stub is planned for RSP microcode:
DMA Q4 weight tiles into DMEM (4KB at a time)
VMULF/VMADH vector multiply-accumulate for 8-lane dot products
Estimated 4-8× speedup over scalar VR4300 inference
----
rsp is the gift that keeps on giving; such a forwards-looking architecture (shame about the rambus latency tho)
We are going to use the gpu 128simd soon but it only has 4kb ram addressable so matmul offload in small chunks!
thats such really cool work; i wish i could get payed to do stuff like this, more power to you all ^^
I am doing this for 0 dollars. I am a self funded ai research lab. So when people diss me i get a little jaded, but then I remember I am doing cool stuff. Even if others don't see it. Thats enough for me!
The readme says:
> This isn't just a tech demo — it's a tool for N64 homebrew developers. Running an LLM natively on N64 hardware enables game mechanics that were impossible in the cartridge era:
> AI analyzes play style and adjusts on the fly
> NPCs that remember previous conversations and reference past events
> In-game level editors where you describe what you want to build
...anyone who has ever used very small language models before should see the problem here. They're fun and interesting, but not exactly, um, coherent.
The N64 has a whopping 8 megabytes (!) of memory, and that's with the expansion pack!
I'm kind of confused, especially since there are no demonstration videos. Is this, um, real? The repository definitely contains source code for something.
I think the source code in the GitHub repo generates the ROM in the corresponding screenshots, but it seems quite barebones.
It feels very much like it’s cobbled together from the libdragon examples directory. Or, they use hardware acceleration for the 2D sprites, but then write fixed-width text to the frambuffer with software rendering.
Partially correct. The value is not the game interface right now. Its proof you can do actual inference on an LLM the surprise I am developing is a bit bigger than this, just have to get the llm outputs right first!
I normally don't write comments like this, but... this title was extremely challenging to parse.
The repo description on GitHub would have been fine
> World's First LLM-powered Nintendo 64 Game — nano-GPT running on-cart on a 93MHz VR4300
Hey guys i had endian mess. I had nano llm text issues. But its resolved im about to issue real proof on emualtor and real hardware!
Yes it runs on emulator. I am fixing the endianess text issue from llm output right now. And the surprise is coming soon. Happy 40th Zelda!
Cool, is there maybe a video demonstrating this?
Any place where we can test the llm output without loading it on n64?
Curious of what we can get out of those constraints.