EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

71 pointsposted 14 hours ago
by blacktechnology

8 Comments

maxglute

9 hours ago

>A man talking as water splashes and gurgles and a motor engine hums in the background.

This the first time I heard AI Simlish. I wonder what the training data was. Seems like work is done by John Hopkins and Tencent, but the fake AI language sounds... Indic? Are there other examples of AI generating speech in... hallucinated languages?

ben_w

7 hours ago

> Are there other examples of AI generating speech in... hallucinated languages?

Sure: https://suno.com/song/0c05e4bd-5879-4e1d-9bdd-555d76569501

No chance that it's getting ancient Summerian correct.

Y_Y

6 hours ago

Of course, you can't just mathematically derive a language that isn't in the training set.

Except Sanskrit, naturally.

alex_duf

8 hours ago

Simlish is the first thing that cam to my mind too.

owenpalmer

20 minutes ago

"A man yells, slams a door and then speaks."

These are hilarious.

tigermafia

6 hours ago

Elevenlabs started rolling out a generator for very basic sound effects. Using it made me wonder what the application for things like this would be. If it was realtime it could be used for games but then there is the lack of predictable quality control.

For (cinematic) sounddesign the quality is not nearly good enough yet. For simple home-style videos dozens of (more fun) options exist - foley, free sound libraries, freesound.org, going out with a phone and record stuff.

earthnail

an hour ago

Same as image generation. When it gets to a certain quality level, it's much faster to describe what you want than to search for it.