RicardoRei
5 days ago
Hi HN - I’m the Head of AI Research at Sword Health and one of the authors of this benchmark (posting from my personal account).
We built MindEval because existing benchmarks don’t capture real therapy dynamics or common clinical failure modes. The framework simulates multi-turn patient–clinician interactions and scores the full conversation using evaluation criteria designed with licensed clinical psychologists.
We validated both patient realism and the automated judge against human clinicians, then benchmarked 12 frontier models (including GPT-5, Claude 4.5, and Gemini 2.5). Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20). We also found that larger or reasoning-heavy models did not reliably outperform smaller ones in therapeutic quality.
We open-sourced all prompts, code, scoring logic, and human validation data because we believe clinical AI evaluation shouldn’t be proprietary.
Happy to answer technical questions on methodology, validation, known limitations, or the failure modes we observed.
embedding-shape
5 days ago
Did you use the same prompts for all the models, or individualized prompts per model? Did you try a range of prompts that were very different from each other, if you used more than a base prompt?
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.
RicardoRei
5 days ago
The prompts are kept the same for all models. Otherwise comparison would not be super fair. In any case you can check all the prompts in our github repo.
embedding-shape
5 days ago
> Otherwise comparison would not be super fair.
Wouldn't that be easy to make fair by making sure all models tried it with the same prompts? So you have model X and Y, and prompts A and B, and X runs once with A, once with B, and same for Y.
Reason I ask, is because in my own local benchmarks I do for each model release with my own tasks, I've noticed a huge variance in quality of responses based on the prompts themselves. Slight variation of wording seems to have a big effect on the final responses, and those variations seems to again have a big variance of effect depending on the model.
Sometimes a huge system prompt makes a model return much higher quality responses while another model gives much higher quality responses when the system prompt is as small as it possible can. At least this is what I'm seeing with the local models I'm putting under test with my private benchmarks.
irthomasthomas
5 days ago
Did you re-test the past models with the new prompt you found? How many times did you run each prompt? Did you use the same rubric to score each experiment?
embedding-shape
5 days ago
> Did you re-test the past models with the new prompt you found?
Yeah, initially I wrote this test/benchmark harness because I wanted to compare multiple different prompts for the same tasks and the same model, but obviously eventually grew out from there. But it still has the prompts at core, and I re-run everything whenever something changes, or I add new models to it.
> How many times did you run each prompt?
It's structured in a way of Category > Task > Case and that's mixed with a list of Prompts for each Task, then each Case runs with each of the Prompts. So I guess you could say that each prompt gets "exercised" the number of existing cases that exists for the Task they're in.
> Did you use the same rubric to score each experiment?
I'm not sure if you mean something specific by "rubric" (I'm not from academia), but they're all pretty much binary "passed" or "not passed". The coding ones are backed by unit tests that were failing, and after test case must pass without being changed, translation ones backed by (mostly) simple string checking, and so on. I don't have any tasks or cases that are "Rate this solution from 0-10" or similar.
EagnaIonat
5 days ago
Models have different nuances though. Llama4 for example you have to explicitly ask it not to output its CoT, whereas GPT you don't.
jbgt
5 days ago
Have you seen the feeling great app? It's not an official therapy app but it's based in TEAM CBT, made by David Burns and team.
Burns is really into data gathering and his app is LLM based on the rails of the TEAM process and it seems to be very well received.
I found it simple and very well done - and quite effective.
A top level comment says that therapists aren't good either - Burns would argue that mainly no one tests before and after and so no measuring effect is done.
And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
taurath
5 days ago
If CBT performed as well as David Burns suggests, we’d really have no need for therapists. Alas, it turns out that cognitive problems aren’t a factor in a lot of mental health. I state this as someone who’s read all the literature and spent 8 years floundering in CBT oriented therapy without much changing but the practitioner. It’s not a cure-all or even a cure-most, but it’s treated as such because it has properties that match well to medical insurance billing practices.
> And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
I could tell you that as a client, but that’s because I’ve read into it. This is sort of like asking an ER patient to describe the shift management system of the clinic they went into.
kayodelycaon
5 days ago
This has been my experience. When it comes down to it, CBT is just more effective version of “try hard harder”.
What’s really aggravating is CBT was never designed to be a general, cure-all therapy and I think the people behind it know this. But try explaining nuance to a public that doesn’t want to hear.
megaman821
5 days ago
Did the real clincians get all 6's in this test?
crazygringo
5 days ago
Right, this result seems meaningless without a human clinician control.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.
RicardoRei
5 days ago
The results are not meaningless but they are not comparing humans against LLMs. The goal is to have something that can be used to test LLMs on a realistic mental health support.
The main points of our methodology are: 1) prove that is possible to simulate patients with an LLM. Which we did. 2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.
Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study
crazygringo
5 days ago
Got it, thank you.
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.
RicardoRei
5 days ago
Again, these things don't depend on each other.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
crazygringo
5 days ago
But you chose the word "struggle". And now you say:
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.
I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.
arisAlexis
5 days ago
Yes exactly. Seems like there is an agenda against LLMs acting as therapists.
palmotea
5 days ago
> Right, this result seems meaningless without a human clinician control.
> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Does it really matter? Per the OP:
>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).
I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).
Also my impression is BetterHelp pays poorly and thus tends to have less skilled and overworked therapists (https://www.reddit.com/r/TalkTherapy/comments/1letko9/is_bet..., https://www.firstsession.com/resources/betterhelp-reviews-su...), e.g.
> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.
So taking it as a baseline would bias any experiment against human therapists.
crazygringo
5 days ago
> Does it really matter?
Yes, it absolutely does matter. Look at what you write:
> I'd assume
> it's reasonable to expect
The whole reason to do a study is to actually study as opposed to assume and expect.
And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.
And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.
JoblessWonder
5 days ago
I love how the top comment on that Reddit post is an *affiliate link* to an online therapy provider.
palmotea
4 days ago
> I love how the top comment on that Reddit post is an affiliate link to an online therapy provider.
Posted 6 months after the post and all the rest of the comments. It's some kind of SEO manipulation. That reddit thread ranked highly in my Google search about Betterhelp being bad, so they're probably trying to piggyback on it.
fragmede
4 days ago
oh no. someone might make money. we can't let other people succeed. someone stop them!
JoblessWonder
4 days ago
I’m not against affiliate links. I’m just pro-disclosure especially for something as important as therapy and it seems like maybe you should mention you make $150 for each person that signs up.
nradov
5 days ago
Yes, text chat is one of the communication options for BetterHelp (and some of their competitors).
RicardoRei
5 days ago
This is a good point. We have not tested the clinicians but I believe they would not score each other perfectly as we observed some disagreement also between the scores which also reflects different opinions between clinicians
megaman821
5 days ago
It is nice to have an accurate measure of things and a human baseline would be additionally helpful too.
Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.
vessenes
5 days ago
Thanks for open sourcing this.
I'm skeptical of the value of this benchmark, and I'm curious for your thoughts - self play / reinforcement tasks can be useful in a variety of arenas, but I'm not a priori convinced they are useful when the intent is to help humans in situations where theories of mind matter.
That is, we're using the same underlying model(s) to simulate both a patient and a judgment as to how patient-like that patient is -- this seems like an area where I'd really want to feel confident that my judge LLM is accurate; otherwise the training data I'm generating is at risk of converging on a theory of mind / patients that's completely untethered from, you know, patients.
Any thoughts on this? Feel like we want a human in the loop somewhere here, probably on scoring the judge LLMs determinations until we feel that the judge LLM is human or superhuman. Until then, this risks building up a self-consistent, but ultimately just totally wrong, set of data that will be used in future RL tasks.