czhu12
7 days ago
The MAMBA [1] model gained some traction as a potential successor. It's basically an RNN without the non linearity applied across hidden states, which makes it logarithmic time (instead of linear time) inference with a parallelizable scan [2].
It promises much faster inference with much lower compute costs, and I think up to 7B params, performs on par with transformers. I've yet to see a 40B+ model trained.
The researches of MAMBA went on to start a company called Cartesia [3], which is MAMBA applied to voice models
[1] https://jackcook.com/2024/02/23/mamba.html
[2] https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Parallel_prefix_... <- Pulled up a random example from google, but Stanford CS149 has an entire lecture devoted to parallel scan.
kla-s
7 days ago
Jamba 1.5 Large is 398B params (94B active) and weights are available.
https://arxiv.org/abs/2408.12570
Credit https://news.ycombinator.com/user?id=sanxiyn for making me aware
imtringued
6 days ago
Mamba isn't really a competitor to transformers. Quadratic attention exists for a reason.
Mamba's strengths lie in being a better RNN as you said. Mamba is probably better than transformers for things like object permanence over a sequence of inputs, where each input is an image, for example.
However, it would still make sense for a transformer to actually process the image by cutting it up into patches and then performing quadratic attention on that and then feeding the transformer input into mamba to get the actual output e.g. a robot action while maintaining object permanence.
monroewalker
7 days ago
Oh that would be awesome for that to work. Thanks for sharing
stavros
7 days ago
If I'm not misremembering, Mistral released a model based on MAMBA, but I haven't heard much about it since.