Flow Computing aims to boost CPUs with ‘parallel processing units’

96 pointsposted 7 hours ago
by rbanffy

32 Comments

interroboink

5 hours ago

Seems like a nice idea — instead of the stark CPU/GPU divide we have today, this would fit somewhere in the middle.

Reminds me slightly of the Cell processor, with its dedicated SPUs for fast processing, orchestrated by a traditional CPU. But we all saw how successful that was (: And that had some pretty big backing.

Overcoming the inertia of the current computing hardware landscape is such a huge task. Maybe they can find some niche(s).

winwang

an hour ago

I'd believe more in a heterogenous chip (e.g. MI300X, Apple M series, or even APUs) than in completely new chip tech.

Animats

4 hours ago

Does anyone know what they mean by "wave synchronization"? That's supposedly their trick to prevent all those parallel CPUs from blocking waiting for data. Found a reference to something called that for transputers, from 1994.[1] May be something else.

Historically, this has been a dead end. Most problems are hard to cut up into pieces for such machines. But now that there's much interest in neural nets, there's more potential for highly parallel computers. Neural net operations are very regular. The inner loop for backpropagation is about a page of code. This is a niche, but it seems to be a trillion dollar niche.

Neural net operations are so regular they belong on purpose-built hardware. Something even more specialized than a GPU. We're starting to see "AI chips" in that space. It's not clear that something highly parallel and more general purpose than a GPU has a market niche. What problem is it good for?

[1] https://www.sciencedirect.com/science/article/abs/pii/014193...

mikewarot

9 minutes ago

The reason problems are hard to fit into most of what's tried is that everyone is trying to save precious silicon space and fit a specific problem, adding special purpose blocks, etc. It's my belief that this is an extremely premature optimization to make.

Why not break it apart into homogeneous bitwise operations? That way everything will always fit. It would also simplify compilation.

narag

4 hours ago

We're starting to see "AI chips" in that space.

"Positronic" came to my mind.

wmf

an hour ago

This is based on legitimate (although second-tier) academic research that appears to combine aspects of GPU-style SIMD/SIMT with Tera-style massive multithreading. (main paper appears to be https://www.utupub.fi/bitstream/handle/10024/164790/MPP-TPA-... )

Historically, the chance of such research turning into a chip you can buy is zero.

throwawayffffas

4 hours ago

How is this different from an integrated gpu other than it presumably doesn't do graphics.

yeahwhatever10

5 hours ago

When will we get the “Mill” cpu?

theLiminator

3 hours ago

I've been following that saga for a long time. Seems mostly like vapourware sadly.

mshook

4 hours ago

At this point, probably never it seems...

pier25

4 hours ago

I'm probably missing something but why not use gpus for parallel processing?

nine_k

3 hours ago

GPUs work on massive amounts of data in parallel, but they execute basically the same operations every step, maybe skipping or slightly varying some steps depending on the data seen by a particular processing unit. But processing units cannot execute independent streams of instructions.

GPUs of course have several parts that can work in parallel, but they are few, and every part consists of large amounts that execute the same instruction stream simultaneously over a large chunk of data.

winwang

2 hours ago

This is not true. Take the NVidia 4090. 128 SMs = 4x128=512 SMSPs. This is the number of warps which can execute independently of each other. In contrast, a warp is a 32-width vector, i.e. 32 "same operations", and up to 512 different batches in parallel. So, it's more like a 512-core 1024-bit vector processor.

That being said, I believe the typical number of warps to saturate an SM is normally around 6 rather than 4, so more like 768 concurrent 32-wide "different" operations to saturate compute. Of course, the issue with that is you get into overhead problems and memory bandwidth issues, both of which are highly difficult to navigate around -- the register file storing all the register of each process is extremely power-hungry (in fact, the most power-hungry part I believe), for example.

A PPU with less vector width (e.g. AVX512) would have proportionally more overhead (possibly more than linearly so in terms of the circuit design). This is without talking about how most programs depend on latency-optimized RAM (rather than bandwidth-optmized GDDR/HBM).

nine_k

an hour ago

I'm happy to stand corrected; apparently my idea about GPUs turned obsolete by now.

JackSlateur

4 hours ago

Because GPU are physically built to manage parallel task, but only a few kinds

They are very specialized

CPU are generics, they have lots of transistors to handle a lot of different instructions

Groxx

3 hours ago

Also moving data to and from the GPU takes MUCH more time than between CPU cores (though combined chips drastically lower this difference).

exabrial

2 hours ago

I’m still waiting for a clockless core… some day

aidenn0

2 hours ago

www.greenarraychips.com

somat

2 hours ago

Whenever I see the word "fintech" this is the article I am expecting, Instead I am disappointment with some drivel about banks.

I am not sure what is wrong with me, you would think my brain would have figured it out by now, but it always parses it wrong. perhaps if it were "finctech" that would help.

cryptoz

2 hours ago

I’ve not yet had this problem but I surely will now! Thanks I guess.

brotchie

5 hours ago

“Now, the team is working on a compiler for their PPU” good luck!

bhouston

4 hours ago

While the Itanium failed the Ageia PPU did succeed with its compiler. It was acquired by NVIDIA and became CUDA.

https://en.wikipedia.org/wiki/Ageia

gdiamos

4 hours ago

It did indeed get merged into the CUDA group but I think the internal CUDA project predated it, or at least, several of the engineers working on it did

mepian

2 hours ago

That's not the same PPU, is it?

claxo

4 hours ago

Indeed, a very smart compiler would be necessary, perhaps too much for the current compiler art, like the itaniun.

But...how about specializing to problems with inherent paralelism? LLMs maybe?

greenavocado

4 hours ago

Is this like the Itanium architecture with its compiler challenges?

petermcneeley

4 hours ago

> Now, the team is working on a compiler for their PPU

I think a language is also required here. Extracting parallelism from C++ is non trivial.

poincaredisk

4 hours ago

Something similar to CUDA or OpenCl should do it, right?

johnklos

4 hours ago

Tell us something new, please.