Video-Guided Foley Sound Generation with Multimodal Controls

46 pointsposted a year ago
by surprisetalk

19 Comments

KaiserPro

a year ago

It would be interesting to play with this model directly (if it ever gets released...)

One interesting thing is that foley is not real life, so depending on where they get the dataset from (or how they "generated" it) they might be learning how things actually sound, or how foley artists make them sound.

Its probably most noticeable in either car noises, or eagles: https://www.youtube.com/watch?v=jQI-ddEPTx4 (which are often overdubbed with hawk noises)

echelon

a year ago

Adobe intern research almost never sees code / weights releases, sadly. It's an academic peg for the students and marketing for Adobe. Maybe future interns will start taking this into account and work somewhere they can reuse their research.

We need a broader open research / open source conversation about ML models. Meta calling their approach "open" and what the OSI deems "open" are both flawed and/or misleading.

China is really leading the "open" AI research game. They release so many goodies - papers, code, weights, training scripts, notebooks, and sometimes even training data. It's gotten to the point that whenever I see a Western publication, I groan.

More recently their work seems to be done on H800s and export controlled GPUs. This is also why they're starting to experiment with wildly innovative architectures. Despite being resource constrained, they're really kicking ass.

fngjdflmdflg

a year ago

>Meta calling their approach "open" and what the OSI deems "open" are both flawed and/or misleading.[...] It's gotten to the point that whenever I see a Western publication, I groan.

llama weights are open, and Qwen 1.0 was based on llama,[0] so clearly those papers are useful to some people, even if some people want to be angry at Meta for not being open enough for them. In fact I would say Qwen is less useful than llama as their technical reports are not nearly as detailed as Meta's. Also, Why groan just from seeing the origin of a project? Just read it to see if it is open or not. And there are also lots of popular Chinese models that are completely closed with no code or paper eg. Hailuo, Kling, etc.

[0] https://arxiv.org/pdf/2309.16609 p. 6

m3kw9

a year ago

That’s a first I’ve seen a demo like that. I have no doubt Hollywood is gonna be changed big time especially with cost and speed reduction, I think in 3 years it could be Hollywood quality/speed/cost of generation and tools.

swatcoder

a year ago

> Hollywood is gonna be changed big time especially with cost and speed reduction

Over 100 years of history suggest that Hollywood experiences an ongoing, strong pressure to make productions more expensive and slower. Productions have been much cheaper and quicker in the past, and there's no technical impediment to making them that way again already (nothing has been lost), but studios and audiences generally want to see the limits of spectacle.

But generative AI does not deliver on "the limits of spectacle" and has no clear path to doing so. It makes average-ish digital content, by definition, and has unsolved challenges with maintaining coherency and consistency across and within sessions/segments. It does do that pretty cheaply and quickly though.

We can expect it to see the most use in already budget-constrained projects, where its compromises are a tolerable backdrop against some other focus (writing, humor, romance, etc), not the blockbusters that have huge budgets and polish demands and that mark the signature of "Hollywood". There, it'll expedite some creative utility tasks as people get the hang of using and improving it, but we can expect that the money and time saved there will just get routed over to other artisan, limit-pushing tasks.

famouswaffles

a year ago

>Productions have been much cheaper and quicker in the past, and there's no technical impediment to making them that way again already (nothing has been lost)

Productions also looked a lot worse in the past. Some productions are more expensive today because that's the budget the kind of production requires. You act like a movie like Guardians of the Galaxy looking as good as it does was even remotely possible decades ago and studios just want to spend more money for vain spectacle. Star Wars 77 was great for its time but that's exactly it, 'for its time'.

77's budget adjusts to 60M today and while you certainly can't recreate a modern 'will hold up' star wars with that budget, 60M today gets you a lot farther than it did decades ago with much better looking movies.

swatcoder

a year ago

"Looks a lot worse" or "Looks better" is an aesthetic judgement so confuses the discussion here.

But what you're actually doing here is agreeing with me. Movies whose intent is specifically to deliver on spectacle -- i.e. the big blockbusters like GotG or Star Wars -- are specifically trying to push the limits of what their financing can deliver for the aesthetic tastes of their current audience. That means competitively throwing piles of money at creative talent and asking them to produce their best ever work.

While using generative AI to trim costs on some non-spectacular supplementary stuff or noisy, brief background stuff can make more budget available for the spectacle, the goal is to spend as much as possible on the spectacle, not cut its costs.

famouswaffles

a year ago

Not every part of these blockbusters is pushing the limits of what's possible, even for spectacle. Studios are quite happy to leave a lot of things as they've previously managed because it's good enough.

liontwist

a year ago

When I look at the budget of modern movie effects compared to quality, I suspect there is money laundering going on.

liontwist

a year ago

Isn’t this kind of a universal problem in media now? Movies compete with every other media. The only voices that can be heard in that competitive marketing environment are big budget, winner take all, projects, and direct access to niche segments (YouTubers).

dagmx

a year ago

You can do foley today for free already.

If you’re primarily talking about the time commitment as cost, anyone who doesn’t have the time won’t care about foley to begin with. It’s an attention to detail that someone turning to automated tools won’t have to begin with.

Beyond that, foley is a very creative process. It’s not just placing foot steps at the right moment, it’s making them sound right contextually. It’s knowing that smacking a spring is a great blaster sound, not just putting a gunshot in.

dylan604

a year ago

Of all the crafts involved in making a movie, audio is one of my favorites (even though as someone from camera department I'd never admit that on set). One of the post houses I worked in had a foley studio, and it has always been one of those things I would love to do. It just looks like so much fun. It's as close to child-like playing as an adult can get. What would it sound like if we did....cool. What would it sound like if we....meh, but we can blend that will that other thing...cool.

rob74

a year ago

Do I understand correctly that the videos themselves are also AI-generated? That would explain the lack of spacebars on both typewriters (the one in the slider at the beginning and the one in the example video).

0_____0

a year ago

The spacebar is there and gets used, it's just dark. Looks like an old Russian typewriter and nothing suggests it's AI generated.

jeff_vader

a year ago

Yup. And the other (green/teal?) typewriter has a spacebar on the edge, same colour as the typewriter body. Ends of it sometimes visible between fingers, it's also used.

whywhywhywhy

a year ago

No code, no model, Adobe involved.

Speaking as an creative AI tools only really get interesting or genuinely useful in a professional setting if you can fine tune them otherwise everything has the same base aesthetic/phonaesthetic brush.

This is before we get into the ethics of Adobe taking artists work without compensation to build closed and paywalled machines to make their work worth less.

Kye

a year ago

The ML-based similar sound search in Ableton Live is very useful. It's the only example I can think of where it's completely uncontroversial.

AI music people: "We have removed the parts of making music that are actually enjoyable. You're welcome!"

Ableton: "The computer will sort and rank all 50,000 of your kick samples so you can actually get somewhere when you spend an hour swapping them out"

HelloUsername

a year ago

Why is the sound of the dog barking also not reversed, like the video does..