I Tried to Give AI "Imagination" to Solve Physics Problems

1 pointsposted 7 hours ago
by a1j9o94

3 Comments

a1j9o94

7 hours ago

Hey HN,

  I spent the last few weeks exploring whether AI systems could benefit from generating video predictions before making decisions—like how humans mentally simulate "what happens if I pour this coffee?" before acting.

  The idea: Show an AI an image, ask "what happens if I push this?", have it generate a video prediction, then compare that prediction to reality. If the prediction looks wrong, maybe the AI could catch its own mistakes.

  The result: Current models can't do this. But I learned some interesting things along the way.

  What I tested:
  - 7 different architectures for predicting future video frames from VLM latent space
  - Whether perceptual similarity (LPIPS) between predicted and actual video correlates with correctness
  - Self-correction loops where the model gets feedback on its predictions

  Key findings:

  1. VLMs can't predict the future – Every architecture I tried performed worse than just copying the current frame as the "prediction." The model understands what's in an image but can't predict what will change.
  2. Visual similarity ≠ semantic correctness – This one surprised me. Wrong predictions often looked MORE similar to reality than correct ones (LPIPS correlation: 0.106). You can't use "does it look right?" to catch mistakes.
  3. Some things worked – Hybrid encoders (DINOv2 + VLM) preserve spatial information that VLMs lose. VLMs understand generated video well (93% semantic retention). Small adapters (10M params) work better than large ones (100M).

  I'm releasing this as a benchmark proposal. Video generation is improving fast—capabilities that don't exist today might emerge in future models. Seems worth tracking.

  Links:
  - Demo video: https://youtu.be/YJxDt_zCrUI
  - Code + paper: https://github.com/a1j9o94/foresight
  - Live demo: https://foresight-demo-kappa.vercel.app

  Built with Qwen2.5-VL, LTX-Video, Modal (GPUs), and the Something-Something v2 dataset.

  Happy to answer questions about the experiments or methodology.

seg_lol

7 hours ago

Why is the demo video not in your readme?

a1j9o94

6 hours ago

Honestly just didn't think about it. Added it.