miki123211
9 hours ago
> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.
I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.
AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.
It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.
vvolhejn
8 hours ago
Author here. I think it's more of a capability issue than a safety issue. Since learning audio is still harder than learning text, audio models don't generalize as well. To fix that, audio models rely on combining information from text and audio (having a single model that consumes/produces both text and audio tokens) and the audio tokens basically end up being an integrated speech-to-text/text-to-speech. This reflects my colleagues' experience working on Moshi, and it seems to be the case for other models too, see the Conclusion section.
Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.
JoshTriplett
7 hours ago
Audio models for speech not understanding pitch, seems similar to how text LLMs often don't understand spelling: it's not what they were trying to recognize.
smusamashah
3 hours ago
There was an example, of ChatGPT copying and responding in the speakers voice mid conversation, on OpenAI blog. This was presented an example on non-alignment.
wordglyph
an hour ago
I used aistudio and it understood pitch and and even emotion with an uploaded mp3
oezi
7 hours ago
> generated from text via a text-to-speech
Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.
I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.
jasonjayr
7 hours ago
IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.
dotancohen
5 hours ago
Where can I read about this?
j45
8 hours ago
Accent detection or consciously ignoring it is a filter step.
tsol
9 hours ago
Did they respond differently depending on what race they thought you were? I'm surprised they would even do that honestly. I thought they were trained on text conversations which presumably wouldn't have any of that to learn from.
OisinMoran
8 hours ago
You can often tell where someone is from from text alone! There are plenty of idiosyncrasies even in how different English speaking countries use the language.
fragmede
8 hours ago
Like, what do you mean? Are there, like, particular mannerisms that people from some regions that are hella unique to those regions?
robotresearcher
7 hours ago
I say old chap, what colour are your mummy’s wellies?
ElevenLathe
4 hours ago
You betcha!
ctxc
7 hours ago
Clever!
anotherhue
8 hours ago
Ah stop
j45
8 hours ago
There are subtle differences in language where two groups can be speaking English and one is having a completely different conversation without saying much.
dotancohen
5 hours ago
This is quite the reason my wife evolved into my ex-wife.
thwarted
8 hours ago
If it did, it responded based on the accent it picked up on not race, because race and accent are orthogonal, correlation does not imply causation.
dotancohen
5 hours ago
Are denying that race and accent are highly correlated?
idonotknowwhy
9 hours ago
Qwen3 omni transcriber can do this. It can describe the voice, emotion very well
85392_school
8 hours ago
I've also had luck with Gemini. If I made a few noises and asked which one was higher pitched, it could easily tell.
sbrother
9 hours ago
I don't think it's just safeguards; they really don't seem to understand pitch at all. I tried asking ChatGPT's advanced voice mode to recognize a tune I was humming, and it insisted it was Beethoven's 5th -- multiple times. I think it must have basically tokenized my humming to "dun dun dun duuun".
bigzyg33k
9 hours ago
advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.
sbrother
8 hours ago
right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.
oezi
8 hours ago
Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.
fragmede
2 hours ago
Which "it" are you referring to? There are models that can.
cubefox
8 hours ago
But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.
bigzyg33k
7 hours ago
we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.
I don't find the articles argument that this is due to tokenisation convincing.
cubefox
7 hours ago
They didn't say it's due to tokenization.
> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.
bongodongobob
6 hours ago
Hmm, the last time I played with GPT voice mode it was able to do all kinds of different accents.