The MilkV Jupiter 2/SpacemiT K3 (RISC-V vector compute)

27 pointsposted 2 days ago
by rcarmo

6 Comments

monster_truck

3 hours ago

These seem pretty neat, and the perf/$ or perf/W is interesting, but it looks like the 32gb version is sold out.

Is the LPDDR5 soldered or can you upgrade it? E: Looks like it's soldered, I wonder what the IMC(?) is actually good for

jackxlau

2 days ago

the handshake process to unlock the A100 cores is pretty interesting it reminds me of driving CPU cores like a GPU command queue.

brucehoult

2 days ago

See:

https://github.com/brucehoult/k3_ai

Or my longer top level comment.

Unlike a GPU or NPU, you can just run all your normal RISC-V Linux programs on the AI cores. Bash, gcc, emacs, nodejs ... whatever you want. It's an extra 40% of scalar processing power, for free.

    Build Linux kernel 7503345ac5f5
    
    14m25.56s SpacemiT K3 8 X100 cores plus 8x A100 cores distcc
    18m3.871s SpacemiT K3 8 X100 cores @2.4 GHz
    19m12.787s i9-13900HX, 24C/32T @5.4 GHz, riscv64/ubuntu docker
    39m23.187s SpacemiT K3 8 A100 cores @2.0 GHz
    42m12.414s Milk-V Megrez 4 P550 cores @1.8 GHz
    67m35.189s VisionFive 2, 4 U74 cores @1.5 GHz
    70m57.001s LicheePi 3A, 8 X60 cores @1.6 GHz
    88m04.185s LicheePi 4A, 4 C910 cores @1.85 GHz
The A100 cores, all by themselves, give more normal processing power e.g. `gcc` than any previous RISC-V SBC except the $2500 64 core Milk-V Pioneer.

jauntywundrkind

3 hours ago

Pentium 4 (2000) released at 3.2GB/s memory bandwidth, and scaled to 6.4GB/s over the years. That was not a chip to be proud of, but it provides a snapshot, a reference point in time to compare against. Having 3GB/s memory bandwidth here is... surprising. Based off the single vs multi-scores looking so lopsided, it sure seems likely. Having an "AI" inference chip with such bandwidth is wild. Comparing to the Cix P1 / Orange Pi 6, that having ~42 GB/s compares well to the P4's L2 cache speed! Wow. RK3588 real world will show ~22GB/s, RPI5 17GB/s.

NVMe reads were faster! (Some interesting potential wins there, assuming you can get data from NVMe onto the core without going through main memory, a feature available since Sandy Bridge-EP (2011), in the form of Data Direct IO aka DDIO). I crack jokes about "PCIe speed ahead", but that's seemingly real here (at huge cost to latency, which CXL promises to remedy).

There is a non-zero chance the main cores cannot saturate what the memory controller can do, that the AI cores have some reserved bandwidth to themselves. I doubt it's going to double the memory bna

One absolute ecosystem gem from this article that I didn't know before: the fact that Orange PI 6 uses CrosEC, the embedded controller for Chromebooks (RIP i guess?). I wonder if this is the newer Zephyr Iot (awesome, also underlies Framework's new embedded controllers) or the older legacy version of CrosEC. Not spoken of flatteringly in this implementation, but it's super notable to me the borrowing of firmware from this place I didn't expect it! But there's good upstream kernel support so makes sense! https://chromium.googlesource.com/chromiumos/platform/ec/+/H...

One architectural nit I need to dig into that's interesting: the shared AI processors on the AI cores appear to have shared AI units. This reminds me a lot of AMD Bulldozer (2011), which had semi-independent CPUs but shared FPU. It was an interesting chip (still haven't disposed of my old FX-8320 server), but not well loved.

Really appreciate the dive into the matrix cores. That's going to take more time for me to look at, but: thanks. I notice the architecture diagram says all cores have AI instructions, not just the A100's. Presumably it's the same instruction set/features?

The memory bandwidth situation here feels so off. We've lived in a world where it's a battle for cores, where how many cores one could ship made chip empires rise and fall. Today, the memory bandwidth wars are on, and supplies are scarce. This looks like a fascinating board with amazing capabilities, but wow, that lack of memory bandwidth here is most surprising.

brucehoult

2 days ago

> That documentation also seems to be the original source of the marketing claims that the K3 provides 60 TOPS of AI compute and can run 30B models at over 10 tokens/s. Well, sort of– as another spoiler, I can share that I hit a hard cap at an effective 3B

I'm running Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf on mine ... picked kind of at random from a web page as I'm a complete n00b at running local LLMs, have never used anything other than ChatGPT or (mostly) Grok.

I get 6-7 tok/s, which is slow for throwing around program code, but fine for general knowledge queries. It's a reasonable speed to read along as it outputs. Simple queries start to give output in about 2 seconds.

e.g. I dunno ..

    > What does auipc do?
    
    `auipc` stands for "Add Upper Immediate to PC" and is a RISC-V instruction that:
    
    ## What it does:
    - Takes a 20-bit immediate value and adds it to the program counter (PC)
    - The immediate is sign-extended and shifted left by 12 bits
    - The result is stored in a destination register
    
    ## Syntax:
    ```
    auipc rd, imm
    ```
    
    ## Key characteristics:
    - **PC-relative addressing**: Uses the current program counter as a base
    - **12-bit offset**: The immediate is shifted left by 12 bits, so it's effectively a 32-bit value
    - **No memory access**: Only performs arithmetic, no memory operations
    - **Used for**: Creating 32-bit immediate values, setting up base addresses for other instructions
    
    ## Example:
    ```assembly
    auipc x5, 0x1000    # x5 = PC + (0x1000 << 12)
    ```
    
    ## Common uses:
    - Setting up base addresses for load/store instructions
    - Creating 32-bit immediate values when the standard `li` instruction can't handle the full range
    - Jump table setup
    - Address calculation for large data structures
    
    The instruction is particularly useful in RISC-V because it allows for efficient 32-bit address generation while maintaining the 16-bit instruction size constraint.
    
    [ Prompt: 20.6 t/s | Generation: 6.5 t/s ]
Output started in about 2 seconds.

    > what is the airspeed velocity of an unladen swallow?
    
    The classic answer to this question comes from Monty Python's "Holy Grail" - when asked about the airspeed velocity of an unladen swallow, the character replies "What do you mean? An African or European swallow?"
    
    But in reality, the airspeed velocity depends on the species:
    
    **European Swallow (Hirundo rustica):**
    - Typical flight speed: 15-20 mph (24-32 km/h)
    - Maximum speed: around 25 mph (40 km/h)
    
    **African Swallow (Hirundo sp.):**
    - Similar speeds to European swallows, though some species can reach up to 30 mph (48 km/h)
    
    The "unladen" part refers to the swallow not carrying anything, which would slow it down. Swallows are quite agile birds that can fly at impressive speeds while maneuvering through the air.
    
    So while the Python reference is the more famous answer, the real-world airspeed of an unladen swallow is roughly 15-25 mph, depending on the specific species and conditions.
    
    [ Prompt: 25.5 t/s | Generation: 6.6 t/s ]
Again, output starts in about two seconds.

This is offline, no internet, and uses 14W while running all 8 A100 "AI" cores at max.

Is this useful? I mean, for something, right?

I asked it to review https://github.com/brucehoult/trv which is a total of 320 lines of code (I used `/read` on a tar file containing the two code files). It thought for 22 minutes before output started and then spent 8 minutes outputting comments at just over 6.5 tok/s.

Nothing there to scare Claude, but 30 minutes total is still faster than asking a colleague for a code review, and probably more comprehensive too. And it did it on about 0.25 cents of electricity.

> Turns out getting a thread onto the A100 cores requires a two-step handshake: > > write the thread’s TID to /proc/set_ai_thread (a kernel interface that unlocks scheduling on cores 8–15 for that specific thread) > then call sched_setaffinity to pin it.

If you want to just run arbitrary Linux programs on the A100 cores, I wrote a small assembly language launcher which does the above PID writing and then EXECs the thing you really want.

    # just run a single program on the A100 cores
    ai as hello.s -o hello.o
    
    # same thing but maybe 1ms faster
    aix /usr/bin/as hello.s -o hello.o
    
    # run a whole build. All processes started by `make` will run on the A100 cores.
    ai make -j8 test
    
    # start a shell on the A100 cores. All programs run from it will be run only on the A100 cores
    ai bash
https://github.com/brucehoult/k3_ai

As normal CPUs the eight 2-wide in-order A100 cores (like an A53 or A55 or Pentium or PPC603) add about 40% normal scalar processing power to the eight X100 cores.

That's better than Hyperthreading and well worth using for some additional processing power. Just kick off a background build, or CI or something there while you do something else on the X100 cores. If you ignore the special "AI" matrix processing extension they are just perfectly normal RISC-V RVA23 cores as far as user code is concerned — and in fact significantly faster than the previous generation K1 chip.

A Linux kernel build on just the A100 "AI" cores is faster than on any previous RISC-V SBC under $1000, including the HiFive Premier P550 or Milk-V Megrez. It's several times faster than the VisionFive 2 or Milk-V Jupiter / BPI-F3.

    Build Linux kernel 7503345ac5f5
    
    14m25.56s SpacemiT K3 8 X100 cores plus 8x A100 cores distcc
    18m3.871s SpacemiT K3 8 X100 cores @2.4 GHz
    19m12.787s i9-13900HX, 24C/32T @5.4 GHz, riscv64/ubuntu docker
    39m23.187s SpacemiT K3 8 A100 cores @2.0 GHz
    42m12.414s Milk-V Megrez 4 P550 cores @1.8 GHz
    67m35.189s VisionFive 2, 4 U74 cores @1.5 GHz
    70m57.001s LicheePi 3A, 8 X60 cores @1.6 GHz
    88m04.185s LicheePi 4A, 4 C910 cores @1.85 GHz
The K3 is also faster than using QEMU/Docker on my 24 core i9-13900 laptop, and while using 25W instead of 200W.

Note the fastest time using a distccd on the X100 cores and another distccd on the A100 cores. This adds a lot of overhead in preprocessing and communication over the network (loopback, but still). But it still gives a pretty nice boost. But running independent tasks on each set of cores is more efficient. Or teaching `gmake` or `ninja` to distribute to two pools of cores using my `ai` launcher would be even better ...

cmrdporcupine

4 hours ago

Curious how this compares -- overall -- to the RK3588 devices that I have a few of.

People have made the NPU on that thing do LLMs, and sounds like around the same level (max 3Bish params, 5-6 tok/s last time I tried).

In terms of raw CPU performance, sounds slower?

But maybe has more cores?

Ouch the memory bandwidth sounds really bad.