I'm trying to change the nature of LLM compute, with an idea that's been scratching at the back of my mind for decades. If it works, someone (likely not me) will make Billions, or even Trillions of dollars. I'm very deep in eccentric old man/lone genius territory. Current odds favor the former.
BitGrid[1] - I'm trying to figure out how to do software for a BitGrid. Long, long ago I looked at computing, and decided that the best high performance strategy was something like an FPGA, but simplified, and optimized around overall performance, and not simply the lowest latency and most efficient use of silicon towards that goal. A vast 2d grid of 4x16 bit look up tables, with latches, seems to be optimal, to me. (It's a hunch, that I could very well be wrong about)
Getting an expression translated to a directed acyclic graph of binary logical operations seems to be something that's been achieved, with extensive help from ChatGPT5.[2] (Warning, it's a huge pile of stuff almost exclusively written by an LLM, not me, as I say at the start of the README.md)
The thing now is to figure out how to map that graph into a grid. The LLM has certain strong opinions about how things should go, using the phrase "physicalize" that have taken it off into a few rabbit holes I've gotten us back out of. I think I'll get us there, but not as soon as I had hoped.
The latest is that just getting bits to move across a grid took some backtracking, but forward progress is being made again. It seemed to think that you can just plop data into the middle of a grid (which required defeating the one advantage of a bitgrid... oops)
The thing I really want to know, is how big a BitGrid would it actually take to hold a frontier model LLM like ChatGPT5. Obviously it's not going to fit on an single chip, it's going to be distributed. I've actually managed to figure the distribution and communications part out. It's just the raw size and a good estimates for the current power requirements, and area of a cell implemented in 7nm or better silicon, and the math falls out the other side. You either save billions of dollars, and gigawatts, or you don't.
How? Imagine a grid of FPGAs big enough to hold ChatGPT5, with all the math programed directly into the logic, not separate ram/matmuls like a sane person would do. Add latching periodically through the flow so that it's effectively a 4,000,000 stage pipeline or so. If you then run it at 1 Ghz, you'd have a maximum throughput of 250 tokens/second. However, you could then run 3,999,999,999 other streams with the other slots through the pipeline. ;-)
I was hoping to get the above software figured out before the submission deadline for the TinyTapeout[3] Sky25a run in 18 days, to get an ASIC made. I think I might have to start working on that submission before I even know if it's worth it. My hunch, and every bit of math I've done before says it will be, but it would be nice to know for sure.
I hope to have some estimates figured out by next month, and will let you know if I made it into Sky25a or not.
[1] https://esolangs.org/wiki/Bitgrid
[2] https://github.com/mikewarot/Bitgrid_Python
[3] https://tinytapeout.com/