OpenAI rolls out Advanced Voice Mode with more voices and a new look

56 pointsposted 7 hours ago
by XavierShaw

66 Comments

modeless

9 minutes ago

I tried asking it to practice Chinese with me. It claimed to be able to identify tones. I tested it by using the wrong tones on purpose and it said my pronunciation was "really great". Seems like it just praises you no matter what you do.

sashank_1509

7 hours ago

I have played with it for 20 minutes and here’s my review:

1. The low latency responses do make a difference. It feels miles better than any other voice chat out there.

2. Its pronunciation is excellent and very human like but it is not quite there. Somehow I can tell instantly that it’s a chatbot, it feels firmly in the uncanny valley.

3. On the same note if I was on call and there was a chatbot on the other side of the call I can instantly tell. It’s a mix of the voice with the way it responds, it just does not sounds like a human talking to you. I tried a bit to make it sound more human like, asking it to stop trying so hard in conversation being briefer etc but I wouldn’t say it made things better

And so my final review is, it is a big achievement over anything out there, nothing else comes close but it is like video game console graphics. You can instantly tell it’s not the real thing and because of that I find it harder to use than just typing to it.

achrono

6 hours ago

>Somehow I can tell instantly that it’s a chatbot [...] because of that I find it harder to use than just typing to it.

That to me is precisely the reason to still use it without hesitation, because once it starts getting very-much-human, I don't know if I want to use it unless I really have to.

I think there's a lot of merit in keeping it sounding just a little artificial so that it is easier to have some psychological distance from what is already an overly anthropomorphic experience.

In religion/religious studies, there is the occasional debate of whether or not deities are/ought to be anthropomorphic, and atheism of course finds the whole notion ridiculous. Considering that our hopes and dreams with AGI can often feel religious -- maybe it's time to take that same lens towards AI.

threeseed

6 hours ago

> I find it harder to use than just typing to it

Systems like this have existed since the 90s e.g. Dragon albeit far more rudimentary.

And the issues are exactly the same: (a) discoverability, (b) efficiency and (c) recoverability.

It is so much easier to have a screen with fixed options that you interact with, can easily see your journey and can go back for any mistakes. Versus with our voice which is the clunkiest, slowest and least precise input method we have.

noahjk

4 hours ago

> Versus with our voice which is the clunkiest, slowest and least precise input method we have.

Some related issues I have:

- my thoughts always seem to be jumbled when talking to AI

- I rush to talk quickly because any pause seems to trigger a response

- I worry words or DSL I use won’t be interpreted properly

This all leads to a pretty poor voice experience for me, and I usually forget half of what I want to talk about.

zurfer

6 hours ago

I understand how (a) discoverability and (c) recoverability are a problem, but what do you mean with (b) efficiency?

Most people talk faster than they can type.

Karunamon

2 hours ago

That assumes perfect accuracy. If a command is misheard then you probably need to correct whatever is now in the wrong state, and then definitely reissue the original command. If its text input then you have to do some select/correct dance. Both of these things take a lot of time.

threeseed

6 hours ago

But that assumes that the speech recognition is perfect.

Which at least for those of us speaking non-US English is never the case.

And you have only to ring your bank and try and transfer money between accounts with it reading out every account number and asking for confirmation every step of the way. Versus a few clicks with a mouse to see that for almost all operational tasks voice is cumbersome and inefficient.

danielbln

5 hours ago

Whisper is incredibly robust, with a vast amount of language. I use it in German as well as English and it's incredibly reliable. Modern, transformer based ASR is a different ballgame.

threeseed

5 hours ago

It needs to be perfect. Every single time.

Because voice interfaces don't have the equivalent of a delete key or allowing the user to quickly select a different option.

infecto

6 hours ago

Voice is the future for certain interfaces. Its only clunky, slow and not precise because of the systems the voice is interacting with.

elif

5 hours ago

That's merely because our conversational capability has become diminished.

I can't keep a conversation going with AI as easily as a person because of my poor skills, no fault of the AI.

I will improve over time, and there is no reason I won't be able to become as natural as jean luc Picard telling his starship what to do.

mewpmewp2

6 hours ago

But would you want it to feel exacty like a real person in the first place? I think for that it would have to make itself far less articulate, etc as well.

empath75

4 hours ago

I generally like it, but anytime I bump up against the guidelines, which you do if you want to do basically anything fun at all (singing!) is the most obnoxious experience, because it feels 10x worse coming from a fake smiling personality that sounds almost human than it does over text.

corobo

5 hours ago

I think the uncanny valley feeling is going to be there no matter what they come up with. I, and therefore my brain, knows the voice is coming from a soulless machine[1] so it'll always feel a little off.

My perfect voice assistant would sound like Auto from Wall-E, which is supposedly a blend of MacOS' Ralph and Zarvox voices. Along the lines of (bear in mind I just wrote this directly into the terminal and didn't spend any time actually blending them lol)

  say -v ralph -r 180 "I'm sorry Dave. I’m afraid I can’t do that" & ; say -v zarvox -r 180 "I'm sorry Dave. I’m afraid I can’t do that"
And yeah I'm almost convinced that the whole voice interaction thing came about because they interact with the computer in Star Trek using voice commands.. which is probably just because watching someone type everything into a keyboard would be some boring telly.

I assume there are folks that do use it and do like it, but do they like it more than just pressing buttons to do things? No worries of being misinterpreted or having to speak like a robot at Alexa because it's failed to turn the lights off 3 voice commands in a row now. It's awesome for accessibility, don't get me wrong, I'm talking in the sense of the primary and most commonly used interface.

[1] Not a criticism, fellow soulless machines.

throwaway13337

5 hours ago

I'm in europe and was able to accesss the feature with a VPN.

Surprising that there isn't a 'hey siri' for chatgpt yet. Obviously, that would make this sort of feature infinitely more useful. This is what monopoly gatekeeping looks like.

The limitations in this feature show the problems with both EU proactive regulation and US underregulation.

Bad regulation has become the biggest issue standing in the way of useful software for humans.

terhechte

4 hours ago

̶H̶o̶w̶ ̶a̶r̶e̶ ̶y̶o̶u̶ ̶u̶s̶i̶n̶g̶ ̶i̶t̶?̶ ̶I̶'̶m̶ ̶t̶r̶y̶i̶n̶g̶ ̶M̶u̶l̶l̶v̶a̶d̶ ̶a̶s̶ ̶a̶n̶ ̶i̶O̶S̶ ̶V̶P̶N̶ ̶a̶n̶d̶ ̶I̶ ̶d̶i̶d̶n̶'̶t̶ ̶g̶e̶t̶ ̶i̶t̶ ̶t̶o̶ ̶w̶o̶r̶k̶.̶ ̶D̶i̶d̶ ̶y̶o̶u̶ ̶b̶u̶y̶ ̶a̶ ̶s̶e̶p̶a̶r̶a̶t̶e̶ ̶s̶u̶b̶s̶c̶r̶i̶p̶t̶i̶o̶n̶ ̶w̶i̶t̶h̶ ̶a̶ ̶U̶S̶ ̶c̶r̶e̶d̶i̶t̶ ̶c̶a̶r̶d̶?̶

Nevermind, I deleted and re-installed the app on iOS while on VPN and now it works!

CubsFan1060

5 hours ago

I dunno, it's a single button press on my phone to access it. That seems plenty fine to me.

tkgally

6 hours ago

I got access to the Advanced Voice mode a couple of hours ago and have started testing it. (I had to delete and reinstall the ChatGPT app on my iPhone and iPad to get it to work. I am a ChatGPT Plus subscriber.)

In my tests so far it has worked as promised. It can distinguish and produce different accents and tones of voice. I am able to speak with it in both Japanese and English, going back and forth between the languages, without any problem. When I interrupt it, it stops talking and correctly hears what I said. I played it a recording of a one-minute news report in Japanese and asked it to summarize it in English, and it did so perfectly. When I asked it to summarize a continuous live audio stream, though, it refused.

I played the role of a learner of either English or Japanese and asked it for conversation practice, to explain the meanings of words and sentences, etc. It seemed to work quite well for that, too, though the results might be different for genuine language learners. (I am already fluent in both languages.) Because of tokenization issues, it might have difficulty explaining granular details of language—spellings, conjugations, written characters, etc.—and confuse learners as a result.

Among the many other things I want to know is how well it can be used for interpreting conversations between people who don’t share a common language. Previous interpreting apps I tested failed pretty quickly in real-life situations. This seems to have the potential, at least, to be much more useful.

(reposted from earlier item that sank quickly)

mnicky

4 hours ago

Interesting that the release comes a day after Google's new models [1]. Seems a bit like strategical timing :) Maybe they waited until some of the competitors release something so that they can upstage his release with theirs?

____

[1] Which, btw, I think deserve better sentiment. On benchmarks, the new Gemini Pro seems to be better than GPT-4o. It's just not so hyped...

kanwisher

6 hours ago

Its pretty amazing, if you are curious just try asking it to do a live translation with a friend that speaks another language, its realtime and very seamless

martypitt

7 hours ago

> Advanced Voice is not yet available in the EU, the UK, Switzerland, Iceland, Norway, and Liechtenstein.

That's disappointing. I wonder if it's related to legal issues, technical issues, or just doing a phased rollout?

sunaookami

6 hours ago

Sam Altman only said this: https://x.com/sama/status/1838864011321872407

>except for jurisdictions that require additional external review

crimsoneer

4 hours ago

...I'm not at all clear what this external review would be for either the UK or Switzerland.

og_kalu

3 hours ago

Under a strict reading of the AI Act, advanced voice mode would be illegal because it is the "use of AI systems to infer emotions of a natural person in the areas of workplace and education institutions, except where the use of AI system is intended to be put in place or into the market for medical or safety reasons"

crimsoneer

an hour ago

But the AI Act doesn't apply in the UK, and as far as I know doesn't in Switzerland either...?

og_kalu

43 minutes ago

Sorry I don't think it does either. Kind of glazed over the fact you were talking about the UK specifically

It's just speculation on my part that that's the actual reason for some of those markets.

It might have to do with (old) EU specific regulations. I know the UK adopted many EU laws to expedite brexit.

isodev

5 hours ago

Or perhaps it’s just not that good with other languages and accents.

terhechte

6 hours ago

Did someone figure out if it works when I use a VPN?

coreyh14444

6 hours ago

In Denmark with a Finnish Teams Account. I've tried NordVPN, uninstalled, reinstalled. From my ipad and iphone and no luck.

crimsoneer

5 hours ago

Given it includes the UK, I assume it's GDPR (and probably linked to the provenance of training data?) rather than any new AI act stuff.

raverbashing

5 hours ago

I initially thought this had to do with regulation as well, but people trying it over a VPN are complaining about latency, which makes me believe there's a technical (deployment regions) reason for it as well

(Also language reasons as well - though it seems it works with other languages)

How well does it work in the UK (and understanding its regional accents?)

IncreasePosts

8 minutes ago

I doubt they would bother blocking the 1 user from Liechtenstein if it was about keeping the usage numbers low for performance reasons.

guappa

6 hours ago

When something is not available in EU, you instantly know they're up to no good.

edit: Sorry if privacy hurts your feelings, people who downvote me.

mewpmewp2

6 hours ago

To give benefit of the doubt, it could also be just about a review process, it might come available after a review.

E.g. one valid use-case would be about storing your voice recording on the cloud in a tokenized format.

Because the model is now directly taking your voice, which I assume as tokens, it can't be immediately deleted as opposed to speech to text, which can be quickly used to convert to text and then deleted.

throwaway13337

5 hours ago

Mario Draghi - EU central bank president - created a huge report for the EU that was just published. He singled out overregulation as a key issue in european progress.

Specifically, the report called out GDPR costing small businesses 'more than' 15% of their profits.

This is indeed quite a hurdle. Privacy isn't really the issue - it's regulation that understands that complexity has a cost. We, as developers, should understand the deep, deep cost of complexity.

As a business owner that respects privacy a great deal, GDPR and regulation like it are still an immense hurdle - the cost in understanding and in doing things to the letter of the law is, I think, hard to grasp from the outside. Regulations occupy a huge amount of space in my head that was previously filled with making a better product.

PDF available here:

https://commission.europa.eu/topics/strengthening-european-c...

lynx23

6 hours ago

> ... you instantly know they're up to no good.

You're refering to the EU here, right?

guappa

6 hours ago

No.

Not that everything they do is good. But something being illegal in the EU is a bad indicator for sure.

ben_w

5 hours ago

On privacy, I agree.

AI it's harder to say, especially with the attitudes of the people making it being a split personality mix of "wow this is incredible it might kill everyone" — that's enough of a surprise to legislators that I don't know what to predict nor who has the right approach from USA (e/acc), EU (conservative), China (not sure, suspect 'harmonious' but acknowledge I'm projecting a national stereotype).

M4v3R

7 hours ago

It's definitely legal issues, probably the same reason Apple is not rolling some of the new features like Apple Intelligence in iOS 18 to EU.

[0] https://www.macworld.com/article/2374452/apple-intelligence-...

nindalf

6 hours ago

> Apple Intelligence was only ever going to support American English this year anyway, with other languages coming in 2025 and beyond. That would somewhat limit its effectiveness in the EU to begin with ...

Their AI product wasn't ready for other languages, not even British English where the DMA definitely doesn't apply.

A person would have to be very naive to believe Apple directly here. They just want to generate bad press for the DMA.

layer8

5 hours ago

Apple will be rolling out AI in Switzerland but not in the EU, despite shared sets of languages.

jacooper

6 hours ago

I don't believe this, gpt4o and chatgpt is already available in the EU. Adding advanced voice mode doesn't change anything.

mewpmewp2

6 hours ago

The main thing compared to the old voice mode is that your speech instead of being fed to a speech to text model is now fed to the LLM directly. I'm not sure if that would be enough to trigger a need for external review though.

If they were using your voice recording for training purposes, then of course absolutely. Which then for a good reason.

Or it could also be about the way the voice recording is transported or where it's stored, for how long it's stored etc. Because speech to text could be achieved even offline and not in cloud and be there for shorter period of time. You can delete the recording and just use the text.

If the voice recording is tokenized in an ongoing conversation then it would be stored indefinitely.

walthamstow

5 hours ago

It doesn't have to change anything. OpenAI are doing it for political reasons

m3kw9

2 hours ago

Some review bullet points:

1. It's a bit too agreeable, example: "thats an excellent point" etc every single time.

2. It understands surprisingly well. example: from experience, when I explain something vaguely, my expectation is that it would not understand, but it does most of the time. It removes the frustration of needing to spell out in much more detail.

3. It feels like talking to a real person, but the way the AI talks in a sort of monotonic ways. Example: it would respond with similar tones/excitement every time.

4. Very useful if you need to chat but doesn't want to chat with humans about some subjects like ideas, and explainations.

user

9 minutes ago

[deleted]

ionwake

6 hours ago

[flagged]

user

8 minutes ago

[deleted]

jsheard

6 hours ago

A lot of EU laws were copy pasted into UK law to expedite Brexit, so it may be that whatever they're blocked by still applies.

nindalf

6 hours ago

Brexit was finalised in 2021. AI legislation came into vogue 2023 onwards.

If there's UK legislation blocking this, then it's entirely on the UK.

ben_w

5 hours ago

I vaguely recall people saying that Ireland has a similar issue with the UK, despite gaining independence starting around a century ago.

Not being a lawyer, I couldn't possibly comment.

crimsoneer

5 hours ago

I think highly likely this is just good old GDPR. Specifically, any personal data used to train the model (which would include any person recording on any audio clip or present in any picture) would have to consent for their personal data being used for the purpose of training OpenAI's model, which they pretty emphatically do not.

ionwake

5 hours ago

Well the. Surely OpenAI just need to make a consent popup for the UK they should do that asap

crimsoneer

5 hours ago

The blocker definitely isn't the current users of the application, but all the people involved in the creation of the training data.

ionwake

4 hours ago

I dont understand what this statement has to do with the conversation regarding releasing a new voice mode over an existing model thats allowed in the UK

crimsoneer

an hour ago

Honestly, it's all speculation, but advanced voice mode is probably trained on significantly more audio recording than the previous voice. So maybe it's more likely to be trained on audio clips that include non-consenting participants.

But yes, the fact is I can't think of any meaningful reason why UK legislation would impact Advanced Voice but not the previous model releases.

ionwake

6 hours ago

Not sure if there are any policy makers reading this in the UK but if this isnt fixed people like will move to the states. I just don't understand why there just cant be an opt-in.

crimsoneer

5 hours ago

But nobody knows what "this" is, which is kind of a problem - Anthropic are releasing models in the UK but not EU, and GPt4 is available in both, so it's not really clear if it's a legal decision or a comms one.

starfezzy

6 hours ago

I was hoping to find a voice like Microsoft's "Guy" (a name, not referring to the gender) or Google Assistant's "Pink". An unambiguously white, masculine, American "radio voice" or "audiobook narrator" voice.

ChatGPT describes this as "A rich, deep, and smooth tone that is pleasant to listen to for extended periods. This often comes from good control over pitch and timbre, creating a voice that resonates well."

If you watch youtube, voices in this theme are the Pirate Software guy, and the voice of The Infographics show.

There are similar voices for every gender, race, and nationality. As an American, Morgan Freeman comes to mind as a comfy black, masculine narrator voice.

All this is to lead up to my point that companies engage in a meticulous science when deciding who should voice roles, and especially when the product itself is literally just a synthetic voice and they near limitless capacity to shape it.

With that in mind, here are the voices that OpenAI wants us to hear:

Breeze: ambiguous gender, white, feminine

Juniper: female, black Maple: female, white Spruce: male, black, masculine Arbor: male, Australian, masculine Sol: female, white Ember: male, black, less masculine Cove: male, Sal Khan, less masculine Vale: female, British

The only one that could be considered a narrator/radio voice is unambiguously black (great if that's your preference). It just seems weird that they would intentionally exclude a masculine white male, and that sucks because those are always my preferred voices when I'm looking for audiobooks or choosing a computer voice. It sucks in particular because OpenAI is not staffed by dumb people—this exclusion was intentional, and that's obnoxious.

My last note on the Advanced Voice feature is that it makes my phone HOT within a few seconds, which will limit it's usefulness on sunny days when I need hands-free use the most while the phone is mounted to my dash. This is when the device is already liable to overheat (display forced to dim, lagging due to shutting down CPU cores, and in the worst case the phone shutting off and refusing to work until it gets cooler).

replwoacause

4 hours ago

What could be the reason for excluding a white male voice, do you think?

user

4 hours ago

[deleted]

pbronez

5 hours ago

Overheating issue is interesting. For me, driving is the key use case for voice bots. I want to have a free flowing conversation with an AI assistant while I’m driving alone. I want it to tie into my work systems so I can dig through the CRM, draft documents, and handle email.