Show HN: Utter, a local-first dictation app for Mac and iPhone

6 pointsposted a day ago
by hubab

6 Comments

Leftium

a day ago

There are literally two new dictation apps on Show HN every week: https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=fal...

This one is unique in that it supports iPhone. I haven't seen mobile support very often.

Despite all these apps, there are two things holding me back from using a dictation app on a regular basis:

- streaming transcription: see words in realtime

- multimodal input: mix voice with keyboard

So I started prototyping this type of realtime multimodal dictation UX: https://rift-transcription.vercel.app

This HN comment captures why streaming is important for transcription: https://hw.leftium.com/#/item/47149479

hubab

a day ago

Streaming transcription is something I’m working on. The main challenge so far has been accuracy. Streaming models, especially cloud ones, often drop enough quality that the tradeoff isn’t always worth it. Local models look more promising, so streaming will likely land there first.

On multimodal input, the UX you’re prototyping where you switch between dictating and typing while composing is interesting. I haven’t really seen that approach before.

The direction I took is a bit different. Instead of mixing modalities mid-composition, dictation becomes context-aware during post-processing. Selected/Copied text or surrounding field content can be inserted into the post-processing prompt so the spoken input is interpreted relative to what’s already on screen.

Leftium

a day ago

Yeah, I will add post-processing to my prototype, too. I already prepared a detailed spec (prototyping new ways to do this, as well): https://github.com/Leftium/rift-transcription/blob/main/spec...

One idea I was tossing around was streaming transcription + batch re-transcription:

- Use streaming transcription, which works most of the time (for example, I've found the Web Speech API pretty good, as well as moonshine)

- If the streaming transcription was poor, select the bad part and re-transcribe with a more accurate batch transcription model.

helro

a day ago

I tested something similar and continous re-transcription was the only way I could get close to batch-level accuracy.

In my current implementation I’m fairly aggressive with it. I don’t rely much on streaming word confidence. Instead I continuously reprocess audio using a sliding window. As new audio comes in, it’s retranscribed together with the previous segment so the model always sees a longer context.

That recovers a lot of the accuracy lost with streaming, but the amount of retranscription makes it hard to justify economically with cloud APIs. That’s why I’m focusing on a local-first approach for now.

r0fl

a day ago

I ask this each time I see one of these apps

How is this different than me using the voice to speech feature on my iPhone or Mac that is built in, and free? I can talk into voice memos as well and get a full transcript even from crazy long files

Thanks

hubab

a day ago

The main differences are transcription quality and what happens after the transcript is generated.

Utter uses GPT-4o Transcribe by default for cloud transcription, and in my experience it’s best in class. The gap is most obvious on names, niche terminology, and technical vocabulary. I use it a lot for prompting coding agents, and I've found Apple’s built-in dictation and most other apps don't come close in terms of accuracy.

It also adds a custom post-processing step. So instead of ending up with a raw transcript, you can record a long, messy voice note and have it turned into a clean, structured markdown notes.

If you want to test the accuracy difference yourself, try dictating this with both Apple dictation and ChatGPT web (uses same model) and compare the output:

“My FastAPI service uses Pydantic, Celery, Redis, and SQLAlchemy, but the async worker is deadlocking when a background task retries after a Postgres connection pool timeout.”