barrell
3 days ago
I recently upgraded a large portion of my pipeline from gpt-4.1-mini to gpt-5-mini. The performance was horrible - after some research I decided to move everything to mistral-medium-0525.
Same price, but dramatically better results, way more reliable, and 10x faster. The only downside is when it does fail, it seems to fail much harder. Where gpt-5-mini would disregard the formatting in the prompt 70% of the time, mistral-medium follows it 99% of the time, but the other 1% of the time inserts random characters (for whatever reason, normally backticks... which then causes it's own formatting issues).
Still, very happy with Mistral so far!
mark_l_watson
3 days ago
It is such a common pattern for LLMs to surround generated JSON with ```json … ``` that I check for this at the application level and fix it. Ten years ago I would do the same sort of sanity checks on formatting when I used LSTMs to generate synthetic data.
mpartel
2 days ago
Some LLM APIs let you give a schema or regex for the answer. I think it works because LLMs give a probability for every possible next token, and you can filter that list by what the schema/regex allows next.
hansvm
2 days ago
Interestingly, that gives a different response distribution from simply regenerating while the output doesn't match the schema.
Rudybega
2 days ago
This is true, but there are methods to greatly reduce the effect of this and generate results that match or even improve overall output accuracy:
e.g. DOMINO https://arxiv.org/html/2403.06988v1
joshred
2 days ago
It sounds like they are describing a regex filter being applied to the model's beam search. LLMs generate the most probable words, but they are frequently tracking several candidate phrases at a time and revising their combined probability. It lets them self correct if a high probability word leads to a low probability phrase.
I think they are saying that if highest probability phrase fails the regex, the LLM is able to substitute the next most likely candidate.
stavros
2 days ago
You're actually applying a grammar to the token. If you're outputting, for example, JSON, you know what characters are valid next (because of the grammar), so you just filter out the tokens that don't fit the grammar.
viridian
2 days ago
I'm sure the reason is the plethora of markdown data is was trained on. I personally use ``` stuff.txt ``` extremely frequently, in a variety of places.
In slack/teams I do it with anything someone might copy and paste to ensure that the chat client doesn't do something horrendous like replace my ascii double quotes with the fancy unicode ones that cause syntax errors.
In readme files any example path, code, yaml, or json is wrapped in code quotes.
In my personal (text file) notes I also use ``` {} ``` to denote a code block I'd like to remember, just out of habit from the other two above.
accrual
2 days ago
Same. For me it's almost like a symbiotic thing to me. After using LLMs for a couple of years I noticed I use code blocks/backticks a lot more often. It's helpful for me as an inline signal like "this is a function name or hostname or special keyword" but it's also helpful for other people/Teams/Slack and LLMs alike.
OJFord
2 days ago
I'm the opposite, always been pretty good about doing that in Slack etc. (or even here where it doesn't affect the rendering) but I sometimes don't bother in LLM chat.
fumeux_fume
2 days ago
Very common struggle, but a great way to prevent that is prefilling the assistant response with "{" or as much JSON output as you're going to know ahead of time like '{"response": ['
XenophileJKO
2 days ago
Just to be clear for anyone reading this, the optimal way to do this is schema enforced inference. You can only get a parsable response. There are failure modes, but you don't have to mess with parsing at all.
psadri
2 days ago
Haven’t tried this. Does it mix well with tool calls? Or does it force a response where you might have expected a tool call?
fumeux_fume
2 days ago
It'll force a response that begins with an open bracket. So if you might need a response with a tool call that doesn't start with "{", then it might not fit your workflow.
Alifatisk
3 days ago
I think this is the first time I stumped upon someone who actually mentions LSTM in a practical way instead of just theory. Cool!
Would you like to elaborate further on how the experience was with it? What was your approach for using it? How did you generate synthetic data? How did it perform?
p1esk
2 days ago
10 years ago I used LSTMs for music generation. Worked pretty well for short MIDI snippets (30-60 seconds).
freehorse
2 days ago
I had similar issues with local models, ended up actually requesting the backticks because it was easier this way, and parsed the output accordingly. I cached a prompt with explicit examples how to structure data, and reused this over and over. I have found that without examples in the prompts some llms are very unreliable, but with caching some example prompts this becomes a non-issue.
mejutoco
2 days ago
Funny, I do the same. Additionally, one can define a json schema for the output and try to load the response as json or retry for a number of times. If it is not valid json or the schema is not followed we discard it and retry.
It also helps with having a field of the json be the confidence or a similar pattern to act as a cut for what response is accepted.
tosh
2 days ago
I think most mainstream APIs by now have a way for you to conform the generated answer to a schema.
Alifatisk
3 days ago
I do use backticks a lot when sharing examples in different format when using LLMs and I have instructed them to do likewise, I also upvote whenever they respond in that matter.
I got this format from writing markdown files, it’s a nice way to share examples and also specify which format it is.
barrell
3 days ago
Yeah, that’s infuriating. They’re getting better now with structured data, but it’s going to be a never ending battle getting reliable data structures from an LLM.
This is maybe more maybe less insidious. It will literally just insert a random character into the middle of a word.
I work with an app that supports 120+ languages though. I give the LLM translations, transliterations, grammar features etc and ask it to explain it in plain English. So it’s constantly switching between multiple real, and sometimes fake (transliterations) languages. I don’t think most users would experience this
epolanski
2 days ago
I had a similar experience on my pipeline.
Was looking to both decrease costs and experiment out of OpenAI offering and ended up using Mistral Small on summarization and Large for the final analysis step and I'm super happy.
They have also a very generous free tier which helps in creating PoCs and demos.
siva7
2 days ago
I thought i was the only one experiencing this slowness. I can't comprehend why something called gpt mini is actually slower than their non-mini counterpart.
barrell
19 hours ago
Nooo you are definitely not alone. gpt-5-nano even is slowest model I’ve used since like 2023, second only to gpt-5-mini
fkyoureadthedoc
3 days ago
Same, my project has a step that selects between many options when a user is trying to do some tasks. The test set for the workflow that supports this has a better success rate by about 7% on gpt-4.1-mini vs gpt-5 and gpt-5-mini (with minimal thinking)
brcmthrowaway
2 days ago
What are you actually making
barrell
2 days ago
I’m making an app to learn multiple languages. This portion of the pipeline is about explaining everything I can determine about a work in a sentence in specifically formatted prose.
Example: https://x.com/barrelltech/status/1963684443006066772?s=46&t=...
WhitneyLand
2 days ago
Were you using structured output with gpt-5 mini?
Is there an example you can show that tended to fail?
I’m curious how token constraint could have strayed so far from your desired format.
barrell
2 days ago
Here is an example of the formatting I desired: https://x.com/barrelltech/status/1963684443006066772?s=46&t=...
Yes I use(d) structured output. I gave it very specific instructions and data for every paragraph, and asked it to generate paragraphs for each one using this specific format. For the formatting, I have a large portion of the system prompt detailing it exactly, with dozens of examples.
gpt-5-mini would normally use this formatting maybe once, and then just kinda do whatever it wanted for the rest of the time. It also would freestyle and put all sorts of things in the various bold and italic sections (using the language name instead of the translation was one of its favorites) that I’ve never seen mistral do in the thousands of paragraphs I’ve read. It also would fail in some other truly spectacular ways, but to go into all of them would just be bashing on gpt-5-mini.
Switched it over to mistral, and with a bit of tweaking, it’s nearly perfect (as perfect as I would expect from an LLM, which is only really 90% sufficient XD)
viridian
2 days ago
I'm curious what your prompts look like, as this is the opposite of my experience. I use lmarena for many of the random one shot questions I have, and I've noticed that mistral-medium is almost always the worse of the two after I blind vote. Feels like it consistently takes losses from qwen, llama, gemini, gpt, you name it. I find it overwhelmingly the most likely to produce factually untrue information to an inquiry.
Would you be willing to share an example prompt? I'm curious to see what it'sesponding well to.
barrell
2 days ago
I provide it with data and ask it to convert it to prose in specific formats.
Mistral medium is ranked #8 on lmsys arena IIRC, so it’s probably just not your style?
I’m also comparing this to gpt-5-mini, not the big boy
viridian
2 days ago
I think input strategy probably accounts for the difference. Usually I'm just asking a short question with no additional context, and usually it's not the sort of thing that has one well defined answer. I'm really asking it to summarize the wisdom of the crowd, so to speak.
For example, I ask, what are the most common targets of removal in magic: the gathering? Mistral's answer is so-so, including a slew of cards you would prioritize removing, but also several you typically wouldn't, including things like mox amber, a 0 cost mana rock. Gemini flash gave far fewer examples, one for each major card type type, but all of them are definitely priority targets that often defined an entire metagame, like Tarmogoyf.
barrell
2 days ago
Ah yeah. I’m only grading it on its prose, formatting, ability to interpret data, and instruction following. I do not use it as a store of knowledge
thijsverreck
3 days ago
Any chance at fixing it with regex parsing or redoing inference when the results are below a certain treshold?
barrell
3 days ago
It’s user facing, so will just have an option for users to regenerate the explanation. It happens rarely enough that it’s not a huge issue, and normally doesn’t effect content (I think once I saw it go a little wonky and end the sentence with a few random words). Just sometimes switches to mono space font in the middle of a paragraph, or it “spells” a word wrong (spell is in quotes because it will spell `chien` as `chi§en`).
It’s pretty rare though. Really solid model, just a few quirks
FranklinMaillot
2 days ago
You may be aware of that, but they released mistral-medium-2508 a few days ago.
barrell
2 days ago
I did not! It’s not on azure yet and I’ve still got some credits to burn. That’s exciting though, hopefully it will iron out this weird ghost character issue.
noreplydev
2 days ago
mistral speed is amazing