AIPedant
11 days ago
This seems to ignore the mixed record of video generation models:
For visual reasoning practice, we can do supervised fine-tuning on sequences similar to the marble example above. For instance, to understand more about the physical world, we can show the model sequential pictures of Slinkys going down stairs, or basketball players shooting 3-pointers, or people hammering birdhouses together....
But where will we get all this training data? For spatial and physical reasoning tasks, we can leverage computer graphics to generate synthetic data. This approach is particularly valuable because simulations provide a controlled environment where we can create scenarios with known outcomes, making it easy to verify the model's predictions. But we'll also need real-world examples. Fortunately, there's an abundance of video content online that we can tap into. While initial datasets might require human annotation, soon models themselves will be able to process videos and their transcripts to extract training examples automatically.
Almost every video generator makes constant "folk physics" errors and doesn't understand object permanence. DeepMind's Veo2 is very impressive but still struggles with object permanence and qualitatively nonsensical physics: https://x.com/Norod78/status/1894438169061269750Humans do not learn these things by pure observation (newborns understand object permanence, I suspect this is the case for all vertebrates). I doubt transformers are capable of learning it as robustly, even if trained on all of YouTube. There will always be "out of distribution" physical nonsense involving mistakes humans (or lizards) would never make, even if they've never seen the specific objects.
throwanem
11 days ago
> newborns understand object permanence
Is that why the peekaboo game is funny for babies? The violated expectation at the soul of the comedy?
broof
11 days ago
Yeah I had thought that newborns famously didnt understand object permanence and that it developed sometime during their first year. And that was why peekaboo is fun, you’re essentially popping in and out of existence.
AIPedant
11 days ago
This is a case where early 20th century psychology is wrong, yet still propagates as false folk knowledge:
https://en.wikipedia.org/wiki/Object_permanence#Contradictin...
namaria
10 days ago
I don't think it's so clear cut. The wikipedia article cites a 1971 book and an article from 1991 as sources.
A paper from 2014 is not so sure this is a settled matter:
"Infant object permanence is still an enigma after four decades of research. Is it an innate endowment, a developmental attainment, or an abstract idea not attributable to non-verbal infants?"
and argues that object permanence is not in fact innate but developed:
"It is argued that object permanence is not innately specified, but develops. The theory proposed here is that object permanence is an attainment that grows from a developmentally prior understanding of object identity"
"In sum, it is posited that permanence is initially dependent on the nature of the occlusion; with development, it becomes a property of objects. Even for 10–12-month-olds, object permanence is still a work-in-progress, manifested on one disappearance transformation but not another."
smusamashah
11 days ago
My kid use to drop things and look down at them from his chair. I didn't understand what he was trying to do. I learned that it was his way of trying to understand how world works. That if he dropped something, will it remain there or disappear.
That contradicts this contradiction, unless there is another explanation.
sepositus
10 days ago
My (much) older kid still does ridiculous things that defy reason (and usually end up with something broken). I don't think it's fair to say that every action they take has some deeper meaning behind it.
"Why were you throwing the baseball at the running fan?" "I don't know...I was bored."
tim333
10 days ago
I'm not sure that stuff dies out fully with adulthood. I imagine part of Musk doing iffy salutes or Trump doing weird tariffs is a curiosity - what is I do this odd thing, I wonder what happens. I can think of examples with myself too - I'm kind of learning short term trading and the behaviors can be counterintuitive.
simplify
11 days ago
A child doesn't always a mental "why" reasoning for their actions. Sometimes kids just behave in coarse playful ways, and those ways happen to be very useful for mental development.
wongarsu
10 days ago
If it's a round thing it might well disappear. Even with full understanding that things don't cease to exist when they leave your field of view, "where do things end up if I drop them" is a pretty big field to experiment with. Youtube is full of videos of adults doing the same, just from more extreme heights or other unusual scenarios (since as adults we have a solid grasp of the simpler scenarios)
user
11 days ago
AIPedant
10 days ago
> I learned that it was his way of trying to understand how world works. That if he dropped something, will it remain there or disappear.
I don’t understand how you learned this. Who told you that’s what was going on in his head?
crispycas12
10 days ago
That's odd. This was content that was still on the MCAT when I took it last year. I even remember keeping the formation of objection permanence occuring ~0-2 years of age on my flashcards.
AIPedant
10 days ago
It’s far from perfect in newborns (containers take some time to understand, and in general infants have weak short-term memory). I also wonder how much effort goes into updating MCAT questions with new scientific developments, especially when there is limited clinical significance - an infant who struggles with object permanence likely has serious neurological/cognitive problems across the board.
throwanem
10 days ago
Have you checked lately, though?
user
11 days ago
andoando
11 days ago
babies pretty much laugh if you're laughing and being silly
moi2388
10 days ago
No, they understand object permanence just fine.
Peekaboo is fun because fun is fun. When doing peekaboo the other person is paying attention to you, and often smiling and being relaxed.
They laugh just as much if you play ‘peekaboo’ without actually covering your face ;)
dinfinity
11 days ago
You provide no actual arguments as to why LLMs are fundamentally unable to learn this. Your doubt is as valuable as my confidence.
viccis
10 days ago
Because the nature of their operation (learning a probability distribution over a corpus of observed data) is not the same as creating synthetic a priori knowledge (object permanence is a case of cause and effect which is synthetic a priori knowledge). All LLM knowledge is by definition a posteriori.
AstralStorm
10 days ago
That LLMs cannot synthesize it into a propri, including other rules of logic and mathematics, is a major failure of the technology...
AIPedant
11 days ago
Well, it's a good thing I didn't say "fundamentally unable to learn this"!
I said that learning visual reasoning from video is probably not enough: if you claim it is enough, you have to reconcile that with failures in Sora, Veo 2, etc. Veo 2's problems are especially serious since it was trained on an all-DeepMind-can-eat diet of YouTube videos. It seems like they need a stronger algorithm, not more Red Dead Redemption 2 footage.
dinfinity
11 days ago
> I said that learning visual reasoning from video is probably not enough
Fair enough; you did indeed say that.
> if you claim it is enough, you have to reconcile that with failures in Sora, Veo 2, etc.
This is flawed reasoning, though. The current state of video generating AI and the completeness of the training set does not reliably prove that the network used to perform the generation is incapable of physical modeling and/or object permanence. Those things are ultimately (the modeling of) relations between past and present tokens, so the transformer architecture does fit.
It might just be a matter of compute/network size (modeling four dimensional physical relations in high resolution is pretty hard, yo). If you look at the scaling results from the early Sora blogs, the natural increase of physical accuracy with more compute is visible: https://openai.com/index/video-generation-models-as-world-si...
It also might be a matter of fine-tuning training on (and optimizing for) four dimensional/physical accuracy rather than on "does this generated frame look like the actual frame?"
zveyaeyv3sfye
11 days ago
[flagged]
nonameiguess
11 days ago
"All of YouTube" brings the same problem as training on all of the text on the Internet. Much of that text is not factual, which is why RLHF and various other fine-tuning efforts need to happen in addition to just reading all the text on the Internet. All videos on YouTube are not unedited footage of the real world faithfully reproducing the same physics you'd get by watching the real world instead of YouTube.
As for object permanence, I don't know jack about animal cognitive development, but it seems important that all animals are themselves also objects. Whether or not they can see at all, they can feel their bodies and sense in some way or other its relation to the larger world of other objects. They know they don't blink in and out of existence or teleport, which seems like it would create a strong bias toward believing nothing else can do that, either. The same holds true with physics. As physical objects existing in the physical world, we are ourselves subject to physics and learn a model that is largely correct within the realm of energy densities and speeds we can directly experience. If we had to learn physics entirely from watching videos, I'm afraid Roadrunner cartoons and Fast and the Furious movies would muddy the waters a bit.