minimaxir
21 hours ago
> Note: we are not releasing any post-trained / IT checkpoints.
I get not trying to cannibalize Gemma, but that's weird. A 540M multimodel model that performs well on queries would be useful and "just post-train it yourself" is not always an option.
jeffjeffbear
21 hours ago
Isn't finetuning the point of the T5 style models, since they perform better for smaller parameter counts?
refulgentis
19 hours ago
It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither.
jeffjeffbear
18 hours ago
> https://huggingface.co/google/t5gemma-2-1b-1b
From here it looks like it still is long context and multimodal though?
>Inputs and outputs Input:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
Total input context of 128K tokens Output:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output context up to 32K tokens
rhdunn
9 hours ago
If you are finetuning the model you need to replicate the training conditions so you don't remove those capabilities. If you just finetune a multi-modal model on text it will lose some of the vision capabilities as the text part of the model will drift from the vision, audio, etc. models. A similar thing happens with finetuning reasoning models.
Even if you did finetune the models with text and images then you could run into issues with using different descriptions for images to what it was trained with. Though you could probably work around that by getting the model to describe the images, but you'll still need to audit the results to correct any issues or add what you are training for.
You can also run into overfitting if your data does not include enough variations along a given training set that the original model had access to.
Using different training parameters could also affect the models capabilities. Just knowing things like the input context isn't enough.
navvyeanand
3 hours ago
This is very true. However, I wonder how much of this can be mitigated by using training data from other open-source models like Olmo3 for textual data, Emu3.5 for vision?
CuriouslyC
7 hours ago
This is the thing that kills me about SFT. It was sensible when most of the compute in a model was in pretraining and the RL was mostly for question answering. Now that RL is driving model capabilities it doesn't make much sense.
On the other hand, RL on deployed systems looks promising to essentially JIT optimize models. Experiments with model routers and agentic rag have shown good results.
sundarurfriend
14 hours ago
This made me compare the figures, and: did they accidentally switch those around, or are the Post-training Reasoning and Factuality scores actually significantly lower than the Pre-training ones?
Edit: Just noticed
> Also note pre-training and post-training benchmarks are different, so scores are not comparable across plots.
The paper gives more details about the specific benchmarks and the scores obtained in them: https://arxiv.org/html/2512.14856v1#S4