jiehong
6 hours ago
For those not trying, this allows Deepseek to understand a picture (instead of just extracting text from it), and it can describe what's in the picture, but this is not an image generation system, so you can't ask it to modify an image.
Personally, I'm a bit surprised the DS chat app still doesn't offer its own text to speech and speech to text features (I know DS doesn't have any ASR model for example, but there are quite a few in the open).
testbjjl
an hour ago
DeepSeek interpreting screenshots and images I send it at fractions of what I pay Claude and ChatGPT, for me, is of far higher priority than supporting dictation. There are workarounds for dictation but not image processing.
paulluuk
6 hours ago
Can you explain what the benefits are of actually "talking" with the bot instead of typing and reading?
As someone who would rather send a slack message to a coworker rather than actually walking over and talk to them, the idea of having to talk with my laptop is not appealing at all, haha.
WhitneyLand
41 minutes ago
It’s crucial to use for driving/walking.
One problem has been ChatGpt/Claude apps don’t really do this well. They use weak and/or non-reasoning models for voice interaction and the UX is not optimized for hands free.
I wrote an iOS chatbot app mainly for this purpose for myself and family/friends. Allows starting/sending voice prompts with the action button so I never have to look at the screen. Supports any model at any reasoning level so conversations are not dumbed down. Added a video transcription tool so any model can “read” YouTube/Tiktok videos and chat about them. Great to discuss lectures on tech topics.
It takes slightly longer to use a reasoning model for voice interaction use but I prefer the intelligence. The latency can be minimized a few ways, bidirectional streaming helps. It’s TTS agnostic, I’ve got a few selectable providers and the output can be prompt styled “use a chill tone that’s not too eager”.
cicko
4 hours ago
If you spend your life sitting in a chair, that's fine. I tend to get all kinds of ideas, questions, and research needs while I'm walking around. Typing a paragraph or two or context takes too much time and is very risky. Especially when driving. But also just walking, cooking, cleaning, etc. Sometimes it's just not practical - winter, carrying stuff... I mostly feel privileged if I can just sit at a computer and type my question and have the time to read the answer.
itake
5 hours ago
I am someone that prefers a slack message to a coworker than talking to them and I use AI.
My current flow is: Google Eloquent to capture 127WPM (my typing is best case is 65wpm). This lets me get the thoughts out without thinking too much about structure or flow, the same way I would brain-dump type it.
Next I use AI to compress, summarize, and restructure to create a clear coherent message for my peer to read (which is way faster for them).
When communicating with AI, its the same thing, except I skip the second step since AI does a good job at understanding my ramblings.
----
It drives me crazy that some cultures only send voice messages to each other. It drives me crazy they can't be respectful of my time and use STT+AI to convert their 90 second monologue to a few written sentences.
jnovek
2 hours ago
I would find this behavior extremely aggravating from a co-worker. If you can’t be bothered to edit down your ramblings by hand, just don’t send me anything at all.
garblegarble
2 hours ago
Slightly off-topic but: does it concern you that you're letting atrophy a very important skill for human communication (organising your thoughts and ideas, and then clearly communicating them to others)?
limflick
38 minutes ago
As someone who's still learning English, this is one thing I'd never use AI for, at least not in the near future, simply because thinking and structuring my thoughts before typing is the same as it is before speaking and actually talking to other people can't be outsourced to AI.
But I imagine if I'd been a native speaker I wouldn't mind using AI like OC does since it's a convenience. Same way I use a calculator for two digit multiplications in real life but spent years learning to do it manually in school.
KronisLV
an hour ago
> It drives me crazy they can't be respectful of my time and use STT+AI to convert their 90 second monologue to a few written sentences.
I have used Whisper to transcribe audio into text in the past. You could probably build a pipeline for that, whether running locally or in the cloud - and the run the transcription through the same summarization agent.
jamwil
an hour ago
What did you do prior to 2023?
rob
an hour ago
I hardly type at all now. I use Handy (free) with Parakeet and use its post-LLM processing feature with a custom prompt tailored towards coding, so I can say things like "Have it go to slash remote dash control" and it'll output "/remote-control". Converts brackets, etc.
Everything is almost instant, it's insanely fast, and lets me work on multiple different agents/windows at the same time fast with cmux.
I use the same thing to talk to people on Slack, iMessage, etc now when I'm working from home instead of typing.
I also can help articulate my thoughts better when I'm thinking them literally out loud instead of just sitting silent and typing them on a computer for hours.
It's just something that you need to try and get used to because I also thought it was something I wouldn't like at first.
weitendorf
4 hours ago
I thought this way until I tried it, and the main difference is that when I'm managing tons of agents at once or just reviewing some plan / approving next steps, or need to give quick feedback/ask a simple followup, the voice interface makes me much faster and more likely to continue because it's lower friction (and in many cases that's good, though not all) and can be hands-free.
Actually, my thoughts on this matter changed so much that it inspired me to get much more into voice controls because I realized how this same problem was basically why some people sucked at remote work or weren't able to properly use tools like claude code, because it was essentially the same problem but worse (typing / messaging feeling too high-friction or raising the barrier for participation). I have a way to let Claude call me now to tell me stuff when I have a bunch of instances out doing stuff and then leave to go home.
I'm trying to get that better integrated in my devloop because I think it makes managing >4 agents simultaneously much more feasible and natural for some people (I used to play Starcraft a lot so I'm used to the multitasking, but it still takes sustained willpower to be constantly "driving" or monitoring things, or to field questions), especially ones who have never served as TLs or people managers before. IMO it's a big performance roadblock for a lot of developers to be treat directing multiple agents simultaneously as some kind of high-stakes/high-cost thing. The kind of developer who would not say anything in a team meeting unless prompted or who thinks everything is stupid by default (because they are afraid of making decisions / being wrong even if only briefly) is both very common and reluctant to work this way, but also really probably needs it to be as productive as more skilled developers.
noduerme
2 hours ago
I don't know about you, but I force myself to read the whole spaghetti thought process of any AI that's actually working on code, and make sure I understand what the hell it just said before I ask questions or give it a green light. Even or especially when whatever it said is full of fluffy stuff about having understood the problem space. That's usually where a well-placed question can bring the entire structure crashing down.
"You're right to push back" has become the gold standard phrase I'm looking for from these things to assure myself that I'm covering all the bases and understanding what it's building (not that that's enough, and not that it isn't still going to build some ungodly blob anyway).
I kinda like using voice to jot down my next questions or iterate on things, but there's a clear danger to it, which is that you may inadvertently be signing off on stuff you haven't thoroughly read. If there's one thing about LLM-written code, it's that the devil is in the details.
justech
an hour ago
Faster, and that's it. If you don't need precision (like with prompting LLMs) the speed gain is massive (*for most people)
pid-1
2 hours ago
I've been using ChatGTP by voice for things like cooking and house repair stuff. It's quite convenient for situations in which your hands are busy.
Other week I fixed a a water valve. After planning the thing with ChatGTP I brought the new valve. Then I described what I was seeing as I swapped the old valve for the new one to make sure everything was right. Really cool experience!
NikolaNovak
2 hours ago
I type as fast as I talk so for majority of my LLM usage I don't need text to speech.
But I love the chatgpt voice interface e.g. on a long drive when I can use it to learn about random stuff (btw, turn advanced voice off for such usage).
Other part though is, hacker news vs regular population, majority of which would much much rather talk and listen than type and read.
kingkongjaffa
an hour ago
I like to talk (stt) but I don't want tts to talk back to me I just want to read the response. voice synthesis is a waste for me personally.
QuantumNomad_
4 hours ago
When I was still using OpenAI, I used it among other things to translate from English to Spanish while talking to Spanish-speaking people in person.
I understand a bit Spanish but I don’t speak Spanish yet, and they don’t speak English.
I speak English to the AI and end with “translate to Spanish, translation only”, and then the AI says the thing I was saying in Spanish (not perfect but good enough, and also it has a slightly weird accent that might be it using English or English influenced text to speech even when speaking Spanish sentences?).
hidelooktropic
an hour ago
I can talk faster than I can type.
noduerme
3 hours ago
This may sound strange and even callous, but I think it's appealing to people who are used to having employees. It's not about speech being a better interface, it's that thinking hard enough to sit down and compose a prompt is too much work if you're used to just yelling at someone.
Pity the managers with no one left to boss around besides the machines coming for their own jobs.
I was asked just yesterday if I could wire up [redacted] so that [redacted profession] could have a realtime voice interface while in the middle of performing [redacted]. My basic answer was yes, but it would be a bit slower than you want if something is going wrong, and it would probably be unethical for a whole lot of reasons.
stranded22
5 hours ago
Accessibility.
arcanemachiner
5 hours ago
Much faster and better flow. Don't knock it til you've tried it.
throawayonthe
5 hours ago
it's very confusing. maaaybe if the stt is good and fast enough, speaking may be faster? english speakers can probably hit 150-180 wpm but seems like a hassle
perching_aix
4 hours ago
It's easier, faster, and more natural to talk than to type for the vast, vast majority of people.
This trivial fact of life is observed every day by e.g.:
- students taking notes and finding it necessary to only jot down key facts so that they can keep up,
- stenographers who require special training and equipment to keep up verbatim with live speech in the courtroom,
- annoying colleagues who insist on "hopping on a quick call" or arranging big, wasteful, and disruptive meetings instead of just writing down their problem / sending a message or email,
- friends who insist on sending short voice messages in DMs instead of typing, because it's more "personal" that way (which to be fair it is, but not to the extent proclaimed).
greenavocado
22 minutes ago
Also vision can be used for "compaction" https://blog.can.ac/2026/06/10/snapcompact/