laughingcurve
5 days ago
Article from 2018/19 and this hypothesis remains just that afaik with plenty of evidence going against it
swyx
5 days ago
i intereviewed Jon (lead author on this paper) and yeah he pretty much disowns it now https://www.latent.space/p/mosaic-mpt-7b
gwern
5 days ago
Could you explain why you think that? I'm looking at the lottery ticket section and it seems like he doesn't disown it; the reason he gives, via Abhinav, for not pursuing it at his commercial job is just that that kind of sparsity is not hardware friendly (except with Cerebras). "It doesn't provide a speedup for normal commercial workloads on normal commercial GPUs and that's why I'm not following it up at my commercial job and don't want to talk about it" seems pretty far from "disowning the lottery ticket hypothesis [as wrong or false]".
oofbey
5 days ago
I think that was pretty clear even when this paper came out - even if you could find these sub networks they wouldn’t be faster on real hardware. Never thought much of this paper, but it sure did get a lot of people excited.
sailingparrot
5 days ago
It was exciting because of what it means regarding how a model learns, regardless on whether or not its commercially applicable.
gwern
5 days ago
(Cerebras is real hardware.)
user
5 days ago
oofbey
5 days ago
It is real in that it exists. It is not real in the sense that almost nobody has access to them. Unless you work at one of the handful of organizations with their hardware, it’s not a practical reality.
aaronblohowiak
5 days ago
how long will that be the case?
oofbey
5 days ago
They have a strange business model. Their chips are massive. So they necessarily only sell them to large customers. Also because of the way they’re built (entire wafer is a single chip) no two chips will be the same. Normally imperfections in the manufacturing result in some parts of the wafer being rejected and other binned as fast or slow chips. If you use the whole wafer you get what you get. So it’s necessarily a strange platform to work with - every device is slightly different.
IshKebab
5 days ago
At least for the foreseeable future (next 50 years say).
laughingcurve
5 days ago
i saw how it nerdsniped an extremely capable faculty member
swyx
5 days ago
he pretty much always says it offline haha but i maay have mixed it up with the subsequent convo we had at neurips https://www.latent.space/p/neurips-2023-startups
laughingcurve
5 days ago
cool beans, thanks for this -- I think it's easier to hear it directly from the authors. I was hesitant to start researchposting and come off like a dick.
also; note to self: If I publish and disown my papers, shawn will interview me :)
yorwba
5 days ago
What evidence against it do you have in mind? I think it's a result of little practical relevance without a way to identify winning tickets that doesn't require buying lots of tickets until you hit the jackpot (i.e. training a large, dense model to completion) but that doesn't make the observation itself incorrect.
kingstnap
5 days ago
The observation itself is also partially incorrect. This is a video I watched a few months ago that went further into the whole how do you deal with subnetworks thing.
https://youtu.be/WW1ksk-O5c0?list=PLCq6a7gpFdPgldPSBWqd2THZh... (timestamped)
At the timestamp they discuss how actually the original ICLR results only worked on these extremely tiny models and larger ones didn't work. The adaptation you need to sort of fix it is to train densely first for a few epochs, only then you can start increasing sparsity.
paulsutter
5 days ago
Watched the video - thanks
Ioannu is saying the paper's idea for training a dense network doesn't work in non-toy networks (the paper's method for selecting promising weights early doesn't improve the network)
BUT the term "lottery ticket" refers to the true observation that a small subset of weights drive functionality (see all pruning papers). It's great terminology because they truly are coincidences based on random numbers.
All that's been disproven is that paper's specific method to create a dense network based on this observation