Scene_Cast2
a year ago
I wonder how much improvement is owed to which changes. I've also never heard of "Muon - Momentum Orthogonalized by Newton-schulz" being used.
EDIT: there's a bit more info on his twitter - https://x.com/kellerjordan0
It looks like he created this optimizer. Works on 2D matrices only.