Modded-NanoGPT: NanoGPT (124M) quality in 3.25B tokens

77 pointsposted 18 hours ago
by ocean_moist

9 Comments

Scene_Cast2

17 hours ago

I wonder how much improvement is owed to which changes. I've also never heard of "Muon - Momentum Orthogonalized by Newton-schulz" being used.

EDIT: there's a bit more info on his twitter - https://x.com/kellerjordan0

It looks like he created this optimizer. Works on 2D matrices only.

molticrystal

15 hours ago

Just needs a Zero To Hero series episode offering line by line commentary to follow along on why each choice was made over alternatives.

byyoung3

10 hours ago

do you have a baseline of the regular implementation with 3x learning rate?

m3kw9

15 hours ago

So it compresses info better.

gavindean90

17 hours ago

Seems like this is a modded NanoGPT not the original.

munchler

17 hours ago

Yes. It’s literally called “Modded-NanoGPT”.