godelski
a month ago
As a machine learning researcher, I don't get why these are called world models.
Visually, they are stunning. But it's nowhere near physical. I mean look at that video with the girl and lion. The tail teleports between legs and then becomes attached to the girl instead of the tiger.
Just because the visuals are high quality doesn't mean it's a world model or has learned physics. I feel like we're conflating these things. I'm much happier to call something a world model if its visual quality is dogshit but it is consistent with its world. And I say its world because it doesn't need to be consistent with ours
KaiserPro
a month ago
I think the issue is that "world models" are poorly defined.
With this kind of image gen, you can sorta plan robot interactions, but its super slow. I need to find the paper that deepmind produced, but basically they took the current camera input, used a text prompt like "robot arm picks up the ball", the video generated the arm motion, then the robot arm moved as it did in the video.
The problem is that its not really a world model, its just image gen. Its not like the model outputs a simulation that you can interact with (without generating more video) Its not like it creates a bunch of rough geo that you can then run physics on (ie you imagine a setup, draw it out and then run calcs on it.)
There is lots of work on making splats editable and semantically labeled, but again thats not like you can run physics on them so simulation is still very expensive. Also the properties are dependent on running the "world model" rather than querying the output at a point in time
godelski
a month ago
> poorly defined.
Poorly defined is not the same as undefined. There are bounds and we have a decent understanding of what this means. Not having the details all worked out is not the same. Though that lack of precision is being used to get away with more slop. > I need to find the paper that deepmind produced
I've seen that paper and the results pretty close to the action. I've even personally talked with people that worked on that paper. It very frequently "forgets" what is outside its view and it very frequently performs non-physically consistent actions. When you evaluate those models don't just try standard things, do weird things. Like keep trying to extend the grabber arm and it shouldn't jump to other parts of the screen. > The problem is that its not really a world model, its just image gen.
Yes, that was my point. Since you agree I'm not sure why you're disagreeing.KaiserPro
a month ago
I don't think I'm disagreeing, just adding more colour.
> It very frequently "forgets" what is outside its view
This was the observations that I saw when we were testing it. My former lab was late to pivoting to robotics, so we were looking at the current state of play to see what machine perception stuff is out there for robotics.
mycall
a month ago
Have you looked at Titan and MIRAS where they use online/updating associative memory that happens to be read out via next-token prediction?
https://research.google/blog/titans-miras-helping-ai-have-lo...
https://arxiv.org/abs/2501.00663
https://arxiv.org/pdf/2504.13173
Much research is going into these directions, but I'm more interested in mind-wandering tangents, involving both attentional control and additional mechanisms (memory retrieval, self-referential processing).
KaiserPro
a month ago
Memory in world models is interesting. But I think the main issue is that its holding everything in pixel space (its not, but it feels like that) rather than concept space. Thats why its hard for it to synthesise consistently.
However I am not qualified really to make that assertion.
godelski
a month ago
Ah, thanks for the clarification. It can be hard to interpret on these forums sometimes.
nurettin
a month ago
> Visually, they are stunning.
The input images are stunning, model's result is another disappointing trip to uncanny valley. But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. That is the world model.
godelski
a month ago
> But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound.
Is the error I pointed out not "horribly contradicting"? > That is the world model.
I would say that if it is non-physical[0] then it's hard to call it a /world/ model. A world is consistent and has a set of rules that must be followed.I've yet to see a claimed world model that actually captures this behavior. Yet it's something every game engine[1] gets very well. We'd call it a bad physics engine if they made the same mistakes we see even the most advanced "world models" do.
This is part of why I'm trying to explain that visual quality is actually orthogonal. Even old Atari games have consistent world models despite being pixelated. Or think about Mario on the original NES. Even the physics breaking in that game are more edge cases and not the norm. But here, things like the lion's tail is not consistent even to a 2D world. I've never bought the explanation that teleporting in front of and behind the leg is an artifact of embedding 3D into 2D[2] because the issue is actually the model not understanding collision and occlusion. It does not understand how the sections relate to one another in the image.
The major problem with these systems is that they just hope that the physics is recovered through enough examples of videos. Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that. It took a long time to develop physics due to these specific limitations. These models don't even have the advantage of being able to interact with the environment. They have no mechanisms to form beliefs and certainly no means to test them. It's essentially impossible to develop physics through observation alone
[0] with respect to the physics of the world being simulated. I want you distinguish real world physics from /a physics/
[1] a game physics engine is a world model. Which, as in stressing in [0], does not necessarily need follow real world physics. Mistakes happen of course but things are generally consistent.
[2] no video and almost no game is purely 2D. They tend to have backgrounds which places some layering but we'll say 2D for convenience and since we have a shared understanding
throwup238
a month ago
> The major problem with these systems is that they just hope that the physics is recovered through enough examples of videos. Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that. It took a long time to develop physics due to these specific limitations. These models don't even have the advantage of being able to interact with the environment. They have no mechanisms to form beliefs and certainly no means to test them. It's essentially impossible to develop physics through observation alone
Sounds like these world models are speed running from Platonic ideals to empiricism.
cindyllm
a month ago
[dead]
kgeist
a month ago
>A world is consistent and has a set of rules that must be followed.
Large language models are mostly consistent, but they have mistakes even in grammar too, from time to time. And it's usually called a "hallucination". Can't we say physics errors are a kind of "hallucination" too, in a world model? I guess the question is, what hallucination rate are we willing to tolerate.
godelski
a month ago
It's not about making no mistakes, it's about the categorical type of mistakes.
Let's consider language as a world, in some abstract sense. Lies may (or may not) be consistent here. Do they make sense linguistically? But then think about the category of errors where they start mixing languages and sound entirely nonsensical. That's rare with current LLMs in standard usage but you can still get them to have full on meltdowns.
This is the class of mistakes these models are making, not the failing to recite truth class of mistakes.
(Not a perfect translation but I hope this explanation helps)
nurettin
a month ago
> Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that.
I studied enough physics to get a mech. eng. diploma. And I still understand the naivete. Observational physics can be derived with ml, and I have derived them, but not with neural nets. Or if you do it with neural nets, you can't alpha zero it, you need to cheat.
andy12_
a month ago
You just have to extrapolate the improvements in consistency in image model from the last couple of years and apply it to these kinds of video models. When in a couple of years they can generate consistent videos of many physical phenomena such that they are nearly indistinguishably from reality, you'll se why they are called "world models".
IAmGraydon
a month ago
>As a machine learning researcher, I don't get why these are called world models.
It's called "world models" because it's a grift. An out-in-the-open, shameless grift. Investors, pile on.
godelski
a month ago
I'm just trying to be a bit more political as it can be hard to communicate the issues. My first degree is actually in physics and I'll just say... over there "world model" implies something very different.
Edit: I said a bit more in the reply to the sibling comment. But we're probably on a similar page.
maplethorpe
a month ago
The tail teleports and reattaches because that is the sort of thing that happens in this special AI world. Even though it looks like a bug, it's actually a physical process being modelled accurately.
godelski
a month ago
I'll remind you I am a ML researcher.
So, you need to say more. Or at least give me some reason to believe you rather than state something as an objective truth and "just trust me". In the long response to a sibling I state more precisely why I have never bought this common conjecture. Because that's what it is, conjecture.
So give me at least some reason to believe you. Because you have neither logos nor ethos. Your answer is in the form of ethos, but without the critical requisites.