timhigins
3 days ago
Might be worth updating the title to "SmolLM: state-of-the-art small language model trained on open datasets" (See the first table of https://huggingface.co/blog/smollm for benchmarks)
It was fascinating digging into this to find their dataset weights defined in a declarative YAML file [2]. 70% is from FineWeb/Commoncrawl but filtered using a classifier trained on Llama-70b's rating from 0-5 of the educational content of the text [3]. This is something we know small models like Phi-3 have been doing for a while, but it's great to see a fully open reproduction of it that beats their benchmarks. Definitely supports the idea you can get even better reasoning at smaller model sizes by carefully filtering and curating your training data (and generating good synthetic data from/distilling bigger models).
You can see the 450k Llama educational value scores here: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-ll... It's interesting, I think the text with 3 scores is really good, but the 5 scores pick content that is not very reasoning or information-heavy but just mentions education or a worksheet. For SmolLM they just took the documents with scores >= 3 so it doesn't matter a ton.
2. https://github.com/huggingface/smollm/blob/9efce803bc7e37727... 3. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier
timhigins
3 days ago
Update: While SmolLM was SOTA at the time of release in July, SmolLM 2 1.7B (which is the newest release) is not currently the best model under 2B params on https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
pixelart34
3 days ago
Exactly, love the openness about the training data which none of the big labs disclose, the v2 apparently uses a better mix according to the model card and scores