It is impressive technically but I think the whole plot and story details are pretty bad.
The gum just doesn't work for me. A black and white mega popular white female jazz singer doesn't really make sense. Maybe a Judy Garland type singer would work but she is singing a style that I don't think makes sense. Like someone making what they think jazz vocals should sound like but they don't really listen to much jazz. Billie Holiday wasn't even that popular.
Just like the black and white part doesn't work for me because you can tell it is just the same color clips but desaturated. While real black and white would be on film and look shot on film.
I think the AI stuff is actually pretty good but the director/human creativity here is actually what is not that good. The sound design and music are pretty bad.
I am waiting to see what Aronofsky can do with these tools since the studios won't let him set 30 million dollars on fire again like with The Fountain.
I wouldn't be surprised if the video models were vastly undertrained compared to our text models. There's probably millions of hours of video we haven't used to train the video models yet.
Still seems like early days on this tech. We're nowhere near the limits.
Just a year ago we could only create the distorted video of Will Smith eating spaghetti. A year from now this is going to be even more flawless.
But what does flawless mean, how is this not flawless? I see very few “flaws” in this. But the comprehensiveness of the video training space is probably just miniscule compared to photo and text.
I don't think Wes Anderson has anything to worry about either, it isn't only panning shots in pastel colors.
The reason it looks like many joined clips is because long-form video generation is currently not possible. Most SOTA models only allow generating a few seconds at a time. Past that it becomes much harder for the model to maintain consistency; objects pop in and out of existence, physics errors are more likely, etc.
I think that these are all limitations that can be improved with scale and iterative improvements. Image and video generation models are not affected as much by the problems that plague LLMs, so it should be possible to improve them by brute force alone.
I'm frankly impressed with this short film. They managed to maintain the appearance of characters across scenes, the sound and lip syncing are well done, the music is great. I could see myself enjoying this type of content in a few years, especially if I can generate it myself.
> I’ll just say I don’t think Christopher Nolan has anything to worry about just yet.
The transition will happen gradually. We'll start seeing more and more generated video in mainstream movies, and traditional directors will get more comfortable with using it. I think we'll still need human directors for a long time to come. They will just use their talents in very different ways.
This is probably the worst it's ever going to be?
That’s not incompatible with a ceiling. I’m not sure what the point you’re trying to make is?