hackernews client

OpenAI rolls out Advanced Voice Mode with more voices and a new look

62 pointsposted a year ago

by XavierShaw

(techcrunch.com)

84 Comments

sashank_1509

a year ago

I have played with it for 20 minutes and here’s my review:

1. The low latency responses do make a difference. It feels miles better than any other voice chat out there.

2. Its pronunciation is excellent and very human like but it is not quite there. Somehow I can tell instantly that it’s a chatbot, it feels firmly in the uncanny valley.

3. On the same note if I was on call and there was a chatbot on the other side of the call I can instantly tell. It’s a mix of the voice with the way it responds, it just does not sounds like a human talking to you. I tried a bit to make it sound more human like, asking it to stop trying so hard in conversation being briefer etc but I wouldn’t say it made things better

And so my final review is, it is a big achievement over anything out there, nothing else comes close but it is like video game console graphics. You can instantly tell it’s not the real thing and because of that I find it harder to use than just typing to it.

achrono

a year ago

>Somehow I can tell instantly that it’s a chatbot [...] because of that I find it harder to use than just typing to it.

That to me is precisely the reason to still use it without hesitation, because once it starts getting very-much-human, I don't know if I want to use it unless I really have to.

I think there's a lot of merit in keeping it sounding just a little artificial so that it is easier to have some psychological distance from what is already an overly anthropomorphic experience.

In religion/religious studies, there is the occasional debate of whether or not deities are/ought to be anthropomorphic, and atheism of course finds the whole notion ridiculous. Considering that our hopes and dreams with AGI can often feel religious -- maybe it's time to take that same lens towards AI.

empath75

a year ago

I generally like it, but anytime I bump up against the guidelines, which you do if you want to do basically anything fun at all (singing!) is the most obnoxious experience, because it feels 10x worse coming from a fake smiling personality that sounds almost human than it does over text.

mewpmewp2

a year ago

But would you want it to feel exacty like a real person in the first place? I think for that it would have to make itself far less articulate, etc as well.

threeseed

a year ago

> I find it harder to use than just typing to it

Systems like this have existed since the 90s e.g. Dragon albeit far more rudimentary.

And the issues are exactly the same: (a) discoverability, (b) efficiency and (c) recoverability.

It is so much easier to have a screen with fixed options that you interact with, can easily see your journey and can go back for any mistakes. Versus with our voice which is the clunkiest, slowest and least precise input method we have.

zurfer

a year ago

I understand how (a) discoverability and (c) recoverability are a problem, but what do you mean with (b) efficiency?

Most people talk faster than they can type.

Karunamon

a year ago

That assumes perfect accuracy. If a command is misheard then you probably need to correct whatever is now in the wrong state, and then definitely reissue the original command. If its text input then you have to do some select/correct dance. Both of these things take a lot of time.

threeseed

a year ago

But that assumes that the speech recognition is perfect.

Which at least for those of us speaking non-US English is never the case.

And you have only to ring your bank and try and transfer money between accounts with it reading out every account number and asking for confirmation every step of the way. Versus a few clicks with a mouse to see that for almost all operational tasks voice is cumbersome and inefficient.

danielbln

a year ago

Whisper is incredibly robust, with a vast amount of language. I use it in German as well as English and it's incredibly reliable. Modern, transformer based ASR is a different ballgame.

soco

a year ago

Time for the mandatory mention of the Scottish voice recognition elevator sketch: https://www.youtube.com/watch?v=MNuFcIRlwdc

threeseed

a year ago

It needs to be perfect. Every single time.

Because voice interfaces don't have the equivalent of a delete key or allowing the user to quickly select a different option.

infecto

a year ago

Voice is the future for certain interfaces. Its only clunky, slow and not precise because of the systems the voice is interacting with.

noahjk

a year ago

> Versus with our voice which is the clunkiest, slowest and least precise input method we have.

Some related issues I have:

- my thoughts always seem to be jumbled when talking to AI

- I rush to talk quickly because any pause seems to trigger a response

- I worry words or DSL I use won’t be interpreted properly

This all leads to a pretty poor voice experience for me, and I usually forget half of what I want to talk about.

elif

a year ago

That's merely because our conversational capability has become diminished.

I can't keep a conversation going with AI as easily as a person because of my poor skills, no fault of the AI.

I will improve over time, and there is no reason I won't be able to become as natural as jean luc Picard telling his starship what to do.

corobo

a year ago

I think the uncanny valley feeling is going to be there no matter what they come up with. I, and therefore my brain, knows the voice is coming from a soulless machine[1] so it'll always feel a little off.

My perfect voice assistant would sound like Auto from Wall-E, which is supposedly a blend of MacOS' Ralph and Zarvox voices. Along the lines of (bear in mind I just wrote this directly into the terminal and didn't spend any time actually blending them lol)

  say -v ralph -r 180 "I'm sorry Dave. I’m afraid I can’t do that" & ; say -v zarvox -r 180 "I'm sorry Dave. I’m afraid I can’t do that"

And yeah I'm almost convinced that the whole voice interaction thing came about because they interact with the computer in Star Trek using voice commands.. which is probably just because watching someone type everything into a keyboard would be some boring telly.

I assume there are folks that do use it and do like it, but do they like it more than just pressing buttons to do things? No worries of being misinterpreted or having to speak like a robot at Alexa because it's failed to turn the lights off 3 voice commands in a row now. It's awesome for accessibility, don't get me wrong, I'm talking in the sense of the primary and most commonly used interface.

[1] Not a criticism, fellow soulless machines.

tkgally

a year ago

I got access to the Advanced Voice mode a couple of hours ago and have started testing it. (I had to delete and reinstall the ChatGPT app on my iPhone and iPad to get it to work. I am a ChatGPT Plus subscriber.)

In my tests so far it has worked as promised. It can distinguish and produce different accents and tones of voice. I am able to speak with it in both Japanese and English, going back and forth between the languages, without any problem. When I interrupt it, it stops talking and correctly hears what I said. I played it a recording of a one-minute news report in Japanese and asked it to summarize it in English, and it did so perfectly. When I asked it to summarize a continuous live audio stream, though, it refused.

I played the role of a learner of either English or Japanese and asked it for conversation practice, to explain the meanings of words and sentences, etc. It seemed to work quite well for that, too, though the results might be different for genuine language learners. (I am already fluent in both languages.) Because of tokenization issues, it might have difficulty explaining granular details of language—spellings, conjugations, written characters, etc.—and confuse learners as a result.

Among the many other things I want to know is how well it can be used for interpreting conversations between people who don’t share a common language. Previous interpreting apps I tested failed pretty quickly in real-life situations. This seems to have the potential, at least, to be much more useful.

(reposted from earlier item that sank quickly)

throwaway13337

a year ago

I'm in europe and was able to accesss the feature with a VPN.

Surprising that there isn't a 'hey siri' for chatgpt yet. Obviously, that would make this sort of feature infinitely more useful. This is what monopoly gatekeeping looks like.

The limitations in this feature show the problems with both EU proactive regulation and US underregulation.

Bad regulation has become the biggest issue standing in the way of useful software for humans.

hentrep

a year ago

>Surprising that there isn't a 'hey siri' for chatgpt yet

Sort of a middleman approach and certainly not perfect, but you can invoke ChatGPT with Siri using Shortcuts.

https://help.openai.com/en/articles/7993358-chatgpt-ios-app-...

aaronharnly

a year ago

Interesting, do you know whether this can invoke Voice Mode?

user

a year ago

[deleted]

reustle

a year ago

I switched to Claude, but when I was still using ChatGPT, I had a shortcut set up to initiate Voice mode directly, yes. I had set it up for the Action button, but you could surely trigger this with Siri too. You'll likely have to unlock with faceID first though.

terhechte

a year ago

̶H̶o̶w̶ ̶a̶r̶e̶ ̶y̶o̶u̶ ̶u̶s̶i̶n̶g̶ ̶i̶t̶?̶ ̶I̶'̶m̶ ̶t̶r̶y̶i̶n̶g̶ ̶M̶u̶l̶l̶v̶a̶d̶ ̶a̶s̶ ̶a̶n̶ ̶i̶O̶S̶ ̶V̶P̶N̶ ̶a̶n̶d̶ ̶I̶ ̶d̶i̶d̶n̶'̶t̶ ̶g̶e̶t̶ ̶i̶t̶ ̶t̶o̶ ̶w̶o̶r̶k̶.̶ ̶D̶i̶d̶ ̶y̶o̶u̶ ̶b̶u̶y̶ ̶a̶ ̶s̶e̶p̶a̶r̶a̶t̶e̶ ̶s̶u̶b̶s̶c̶r̶i̶p̶t̶i̶o̶n̶ ̶w̶i̶t̶h̶ ̶a̶ ̶U̶S̶ ̶c̶r̶e̶d̶i̶t̶ ̶c̶a̶r̶d̶?̶

Nevermind, I deleted and re-installed the app on iOS while on VPN and now it works!

CubsFan1060

a year ago

I dunno, it's a single button press on my phone to access it. That seems plenty fine to me.

modeless

a year ago

I tried asking it to practice Chinese with me. It claimed to be able to identify tones. I tested it by using the wrong tones on purpose and it said my pronunciation was "really great". Seems like it just praises you no matter what you do.

tkgally

a year ago

I tested its ability to detect issues in my nonnative pronunciation of Japanese and Russian. As with you, it praised me more than I wanted, but it also provided pointed, appropriate feedback.

I was particularly impressed that it corrected the pitch accent of some words I said in Japanese. I speak Japanese fluently but, because I began learning as an adult, I have a foreign accent that I am unable to lose. One major component of my nonnative sound is my inadequate acquisition of the pitch accent system. Nobody ever corrects me in conversation and it would be annoying if they did. If, when I started learning Japanese forty years ago, I had had a bot that could hear and correct my pronunciation, I would have less of a foreign accent now.

Some prompt engineering is needed, though, to get rid of that excessive praise. In my next tests, I will just tell it not to praise me at all. That should work.

kanwisher

a year ago

Its pretty amazing, if you are curious just try asking it to do a live translation with a friend that speaks another language, its realtime and very seamless

m3kw9

a year ago

Some review bullet points:

1. It's a bit too agreeable, example: "thats an excellent point" etc every single time.

2. It understands surprisingly well. example: from experience, when I explain something vaguely, my expectation is that it would not understand, but it does most of the time. It removes the frustration of needing to spell out in much more detail.

3. It feels like talking to a real person, but the way the AI talks in a sort of monotonic ways. Example: it would respond with similar tones/excitement every time.

4. Very useful if you need to chat but doesn't want to chat with humans about some subjects like ideas, and explainations.

mnicky

a year ago

Interesting that the release comes a day after Google's new models [1]. Seems a bit like strategical timing :) Maybe they waited until some of the competitors release something so that they can upstage his release with theirs?

____

[1] Which, btw, I think deserve better sentiment. On benchmarks, the new Gemini Pro seems to be better than GPT-4o. It's just not so hyped...

martypitt

a year ago

> Advanced Voice is not yet available in the EU, the UK, Switzerland, Iceland, Norway, and Liechtenstein.

That's disappointing. I wonder if it's related to legal issues, technical issues, or just doing a phased rollout?

sunaookami

a year ago

Sam Altman only said this: https://x.com/sama/status/1838864011321872407

>except for jurisdictions that require additional external review

crimsoneer

a year ago

...I'm not at all clear what this external review would be for either the UK or Switzerland.

mike_hearn

a year ago

Judging from the list of countries this looks like a Europe vs EU confusion somewhere inside OpenAI. That is someone said "Europe" needs external review, and this got translated into "EU+other countries in Europe" for launch control.

Unfortunately the EU strongly encourages this sort of confusion. EU Commission staff and MEPs constantly say "Europe" when what they mean is the EU. And because most of Europe is in the EU, it's especially confusing.

og_kalu

a year ago

Under a strict reading of the AI Act, advanced voice mode would be illegal because it is the "use of AI systems to infer emotions of a natural person in the areas of workplace and education institutions, except where the use of AI system is intended to be put in place or into the market for medical or safety reasons"

crimsoneer

a year ago

But the AI Act doesn't apply in the UK, and as far as I know doesn't in Switzerland either...?

og_kalu

a year ago

Sorry I don't think it does either. Kind of glazed over the fact you were talking about the UK specifically

It's just speculation on my part that that's the actual reason for some of those markets.

It might have to do with (old) EU specific regulations. I know the UK adopted many EU laws to expedite brexit.

sunaookami

a year ago

Someone else on X speculated that it might be in relation to GDPR article 35 and 36. GDPR would make more sense since the UK still has it and I think other non-EU countries are also included/have similar laws.

lxgr

a year ago

How so (for personal use, at least)?

og_kalu

a year ago

To be clear,

1. This is just speculation

2. It could just be enough reason to cause review

casualbob_uk

a year ago

The app announced I had access to advanced voice mode about 10 mins ago, gave me the guided tour and got me to chose a voice, and now it's gone again, with the message back that it's on its way

During this time, I dont believe I actually had access to it, as it wouldn't hum, laugh, pick up on my voice tones etc

Got my hopes up!

raverbashing

a year ago

I initially thought this had to do with regulation as well, but people trying it over a VPN are complaining about latency, which makes me believe there's a technical (deployment regions) reason for it as well

(Also language reasons as well - though it seems it works with other languages)

How well does it work in the UK (and understanding its regional accents?)

IncreasePosts

a year ago

I doubt they would bother blocking the 1 user from Liechtenstein if it was about keeping the usage numbers low for performance reasons.

raverbashing

a year ago

Not blocked due to load, but latency due to distance

casualbob_uk

a year ago

Im in the UK and its just been enabled for my pro account about 10 mins ago

isodev

a year ago

Or perhaps it’s just not that good with other languages and accents.

terhechte

a year ago

Did someone figure out if it works when I use a VPN?

coreyh14444

a year ago

In Denmark with a Finnish Teams Account. I've tried NordVPN, uninstalled, reinstalled. From my ipad and iphone and no luck.

crimsoneer

a year ago

Given it includes the UK, I assume it's GDPR (and probably linked to the provenance of training data?) rather than any new AI act stuff.

guappa

a year ago

When something is not available in EU, you instantly know they're up to no good.

edit: Sorry if privacy hurts your feelings, people who downvote me.

throwaway13337

a year ago

Mario Draghi - EU central bank president - created a huge report for the EU that was just published. He singled out overregulation as a key issue in european progress.

Specifically, the report called out GDPR costing small businesses 'more than' 15% of their profits.

This is indeed quite a hurdle. Privacy isn't really the issue - it's regulation that understands that complexity has a cost. We, as developers, should understand the deep, deep cost of complexity.

As a business owner that respects privacy a great deal, GDPR and regulation like it are still an immense hurdle - the cost in understanding and in doing things to the letter of the law is, I think, hard to grasp from the outside. Regulations occupy a huge amount of space in my head that was previously filled with making a better product.

PDF available here:

https://commission.europa.eu/topics/strengthening-european-c...

guappa

a year ago

Lol. Citing Mario Draghi to convince someone? Who's next? Satan?

mewpmewp2

a year ago

To give benefit of the doubt, it could also be just about a review process, it might come available after a review.

E.g. one valid use-case would be about storing your voice recording on the cloud in a tokenized format.

Because the model is now directly taking your voice, which I assume as tokens, it can't be immediately deleted as opposed to speech to text, which can be quickly used to convert to text and then deleted.

lynx23

a year ago

> ... you instantly know they're up to no good.

You're refering to the EU here, right?

guappa

a year ago

No.

Not that everything they do is good. But something being illegal in the EU is a bad indicator for sure.

ben_w

a year ago

On privacy, I agree.

AI it's harder to say, especially with the attitudes of the people making it being a split personality mix of "wow this is incredible it might kill everyone" — that's enough of a surprise to legislators that I don't know what to predict nor who has the right approach from USA (e/acc), EU (conservative), China (not sure, suspect 'harmonious' but acknowledge I'm projecting a national stereotype).

pelorat

a year ago

Right, but this is not illegal, nor is the AI act any sort of criminal law. At Unsurprisingly all they want is to avoid insane fines due to vague EU regulations.

M4v3R

a year ago

It's definitely legal issues, probably the same reason Apple is not rolling some of the new features like Apple Intelligence in iOS 18 to EU.

[0] https://www.macworld.com/article/2374452/apple-intelligence-...

nindalf

a year ago

> Apple Intelligence was only ever going to support American English this year anyway, with other languages coming in 2025 and beyond. That would somewhat limit its effectiveness in the EU to begin with ...

Their AI product wasn't ready for other languages, not even British English where the DMA definitely doesn't apply.

A person would have to be very naive to believe Apple directly here. They just want to generate bad press for the DMA.

layer8

a year ago

Apple will be rolling out AI in Switzerland but not in the EU, despite shared sets of languages.

lxgr

a year ago

We don't know that yet.

My suspicion is that Apple is stomping its foot as loudly as possible about the evil European Commission while getting language/localization support in order.

I wouldn't be surprised at all if they suddenly and miraculously discovered a way to somehow still be compliant in the last minute and launch the feature as planned to not endanger their sales.

jacooper

a year ago

I don't believe this, gpt4o and chatgpt is already available in the EU. Adding advanced voice mode doesn't change anything.

mewpmewp2

a year ago

The main thing compared to the old voice mode is that your speech instead of being fed to a speech to text model is now fed to the LLM directly. I'm not sure if that would be enough to trigger a need for external review though.

If they were using your voice recording for training purposes, then of course absolutely. Which then for a good reason.

Or it could also be about the way the voice recording is transported or where it's stored, for how long it's stored etc. Because speech to text could be achieved even offline and not in cloud and be there for shorter period of time. You can delete the recording and just use the text.

If the voice recording is tokenized in an ongoing conversation then it would be stored indefinitely.

lxgr

a year ago

In the end, you can put whatever you want to external review, and similarly to how nobody gets fired for hiring a big consulting firm, nobody gets fired for flagging something for legal review...

pelorat

a year ago

This voice mode can understand emotions, something the AI act regulates as "very risky"

walthamstow

a year ago

It doesn't have to change anything. OpenAI are doing it for political reasons

riwsky

a year ago

If "speaks faster, doesn't mind being interrupted, and still will happily spout bullshit in multiple languages" is what defines advanced voice mode, then that must mean humanity's advanced voice mode is "being from New York".

user

a year ago

[deleted]

ionwake

a year ago

[flagged]

user

a year ago

[deleted]

jsheard

a year ago

A lot of EU laws were copy pasted into UK law to expedite Brexit, so it may be that whatever they're blocked by still applies.

nindalf

a year ago

Brexit was finalised in 2021. AI legislation came into vogue 2023 onwards.

If there's UK legislation blocking this, then it's entirely on the UK.

ben_w

a year ago

I vaguely recall people saying that Ireland has a similar issue with the UK, despite gaining independence starting around a century ago.

Not being a lawyer, I couldn't possibly comment.

crimsoneer

a year ago

I think highly likely this is just good old GDPR. Specifically, any personal data used to train the model (which would include any person recording on any audio clip or present in any picture) would have to consent for their personal data being used for the purpose of training OpenAI's model, which they pretty emphatically do not.

ionwake

a year ago

Well the. Surely OpenAI just need to make a consent popup for the UK they should do that asap

crimsoneer

a year ago

The blocker definitely isn't the current users of the application, but all the people involved in the creation of the training data.

ionwake

a year ago

I dont understand what this statement has to do with the conversation regarding releasing a new voice mode over an existing model thats allowed in the UK

crimsoneer

a year ago

Honestly, it's all speculation, but advanced voice mode is probably trained on significantly more audio recording than the previous voice. So maybe it's more likely to be trained on audio clips that include non-consenting participants.

But yes, the fact is I can't think of any meaningful reason why UK legislation would impact Advanced Voice but not the previous model releases.

ionwake

a year ago

Not sure if there are any policy makers reading this in the UK but if this isnt fixed people like will move to the states. I just don't understand why there just cant be an opt-in.

crimsoneer

a year ago

But nobody knows what "this" is, which is kind of a problem - Anthropic are releasing models in the UK but not EU, and GPt4 is available in both, so it's not really clear if it's a legal decision or a comms one.

delilahnoah

a year ago

[flagged]

starfezzy

a year ago

I was hoping to find a voice like Microsoft's "Guy" (a name, not referring to the gender) or Google Assistant's "Pink". An unambiguously white, masculine, American "radio voice" or "audiobook narrator" voice.

ChatGPT describes this as "A rich, deep, and smooth tone that is pleasant to listen to for extended periods. This often comes from good control over pitch and timbre, creating a voice that resonates well."

If you watch youtube, voices in this theme are the Pirate Software guy, and the voice of The Infographics show.

There are similar voices for every gender, race, and nationality. As an American, Morgan Freeman comes to mind as a comfy black, masculine narrator voice.

All this is to lead up to my point that companies engage in a meticulous science when deciding who should voice roles, and especially when the product itself is literally just a synthetic voice and they near limitless capacity to shape it.

With that in mind, here are the voices that OpenAI wants us to hear:

Breeze: ambiguous gender, white, feminine

Juniper: female, black Maple: female, white Spruce: male, black, masculine Arbor: male, Australian, masculine Sol: female, white Ember: male, black, less masculine Cove: male, Sal Khan, less masculine Vale: female, British

The only one that could be considered a narrator/radio voice is unambiguously black (great if that's your preference). It just seems weird that they would intentionally exclude a masculine white male, and that sucks because those are always my preferred voices when I'm looking for audiobooks or choosing a computer voice. It sucks in particular because OpenAI is not staffed by dumb people—this exclusion was intentional, and that's obnoxious.

My last note on the Advanced Voice feature is that it makes my phone HOT within a few seconds, which will limit it's usefulness on sunny days when I need hands-free use the most while the phone is mounted to my dash. This is when the device is already liable to overheat (display forced to dim, lagging due to shutting down CPU cores, and in the worst case the phone shutting off and refusing to work until it gets cooler).

pbronez

a year ago

Overheating issue is interesting. For me, driving is the key use case for voice bots. I want to have a free flowing conversation with an AI assistant while I’m driving alone. I want it to tie into my work systems so I can dig through the CRM, draft documents, and handle email.

Jordan-117

a year ago

I've been using Ember with standard voice mode because it nostalgically reminds me of the style of narration from certain old 90s edu-games, but it doesn't register as "black" to me. Ditto what you term the "Sal Khan" voice (?), which seems even closer to the style you describe (personally, it reminds me of the voice of the military bot TARS from Interstellar, voiced by Bill Irwin). And a gruff Australian man ain't exactly "woke" or whatever. Odd take.

PS: You can modulate the standard voices with a custom system prompt, including asking them to speak at a lower tone or with an American accent.

user

a year ago

[deleted]

replwoacause

a year ago

What could be the reason for excluding a white male voice, do you think?