AI video you can both watch and interact with in realtime

AI video you can both watch and interact with in realtime

AI Plus
🎨Art
AI format

Odyssey

Created time
May 29, 2025 7:29 PM
Highly Recommend
Platform
Website
Posted By

Odyssey

설명링크

체험링크

image

We’re launching our first interactive video experience

The emergence of a new medium of storytelling

Odyssey is an AI lab on a mission to empower creatives to tell never-before-told stories. We began this journey building world models to accelerate film and game production—but through our research, we’re now seeing the earliest glimpses of an entirely new medium of entertainment.

We call this interactive video—video you can both watch and interact with, imagined entirely by AI in real-time. It's something that looks like video you watch every day, but which you can interact and engage with in compelling ways (with your keyboard, phone, controller, and eventually audio). Consider it an early version of the Holodeck.

A research preview of interactive video

Today marks the beginning of our journey to bring this to life, with the public launch of our first interactive video experience. Powering this is a new world model, demonstrating capabilities like generating pixels that feel realistic, maintaining spatial consistency, learning actions from video, and outputting coherent video streams for 5 minutes or more. What’s particularly remarkable is its ability to generate and stream new, realistic video frames every 40ms.

The experience today feels like exploring a glitchy dream—raw, unstable, but undeniably new. While its utility is limited for now, improvements won’t be driven by hand-built game engines, but rather by models and data. We believe this shift will rapidly unlock lifelike visuals, deeper interactivity, richer physics, and entirely new experiences that just aren’t possible within traditional film and gaming.

On a long enough time horizon, this becomes the world simulator, where pixels and actions look and feel indistinguishable from reality, enabling thousands of never-before-possible experiences.

Powered by a real-time world model

A world model is, at its core, an action-conditioned dynamics model. Given the current state of the world, an incoming action, and a history of states and actions, the model attempts to predict the next state of the world in the form of a video frame. It's this architecture that's unlocking interactive video, along with other profound applications.

image

The architecture of a world model

Compared to language, image, or video models, world models are still nascent—especially those that run in real-time. One of the biggest challenges is that world models require autoregressive modeling, predicting future state based on previous state. This means the generated outputs are fed back into the context of the model. In language, this is less of an issue due to its more bounded state space. But in world models—with a far higher-dimensional state—it can lead to instability, as the model drifts outside the support of its training distribution. This is particularly true of real-time models, which have less capacity to model complex latent dynamics. Improving this is an area of research we're deeply invested in.

To improve autoregressive stability for this research preview, what we’re sharing today can be considered a narrow distribution model: it's pre-trained on video of the world, and post-trained on video from a smaller set of places with dense coverage. The tradeoff of this post-training is that we lose some generality, but gain more stable, long-running autoregressive generation.

To broaden generalization, we’re already making fast progress on our next-generation world model. That model—shown in raw outputs below—is already demonstrating a richer range of pixels, dynamics, and actions, with noticeably stronger generalization.

The earliest version of the world simulator

Looking ahead, we’re researching richer world representations that capture dynamics far more faithfully, while increasing temporal stability and persistent state. In parallel, we’re expanding the action space from motion to world interaction, learning open actions from large-scale video.

Learning not just video, but the actions that shape it

Early research on interactive video has focused on learning pixels and actions from game worlds like Minecraft or Quake, where the pixels are constrained, the motion is basic, the actions possible are limited, and the physics simplified. These limitations and lack of diversity make it easier to model how actions affect pixels, but game worlds enforce a low, known ceiling on what’s possible with these models.

It's our belief that learning both pixels and actions from decades of real-life video—like what you see below—has the potential to lift that ceiling, unlocking models that learn life-like visuals and the full, unbounded range of actions we take in the world—beyond the traditional game logic of walk here, run there, shoot that.

Learning the real world

Learning from open-ended real-life video is an incredibly hard problem. The visuals are noisy and diverse, actions are continuous and unpredictable, and the physics are—well—real. But, it’s what will ultimately unlock models to generate unprecedented realism.

A world model, not a video model

At first glance, interactive video seems like a perfect application of video models. However, the architecture, parameter count, and datasets of typical video models aren’t conducive to generating video in real-time that’s influenced by user actions.

World Model
Video Model
Predicts one frame at a time, reacting to what happens.
Generates a full video in one go.
Every future is possible.
The model knows the end from the start.
Fully interactive—responds instantly to user input at any time.
No interactivity—the clip plays out the same every time.

As one example difference, video models generate a fixed set of frames in one go. They do this by building a structured embedding that represents a whole clip—which works great for clip generation, where nothing needs to change mid-stream—but is a non-starter for interactivity. Once the video embedding is set, you’re locked in, meaning you can only adjust the video at fixed intervals.

A world model, however, works very differently. They predict the next world state given the current state and an action, and can do so at a flexible interval. Because new inputs from the user can happen at any moment, that interval can be as short as a single frame of video—allowing the user to guide video generation in real-time with their actions. For interactive video, this is essential.

image

A world model works differently to a video model

Served by real-time infrastructure

The model in our research preview is capable of streaming video at up to 30 FPS from clusters of H100 GPUs in the US and EU. Behind the scenes, the moment you press a key, tap a screen, or move a joystick, that input is sent over the wire to the model. Using that input and frame history, the model then generates what it thinks the next frame should be, streaming it back to you in real-time.

This series of steps can take as little as 40 ms, meaning the actions you take feel like they’re instantaneously reflected in the video you see. The cost of the infrastructure enabling this experience is today $1-$2 per user-hour, depending on the quality of video we serve. This cost is decreasing fast, driven by model optimization, infrastructure investments, and tailwinds from language models.

Zooming out, we believe it’s going to be difficult to ignore the ramifications of how interactive video is “produced,” where unique, interactive experiences can be imagined instantly by AI at an extremely low relative cost.

A new form of video is emerging

New ways to tell stories have always emerged from new technologies: paintings, books, photography, film, radio, video games, visual effects, social media, streaming. It’s a story as old as time.

Interactive video—enabled by real-time world models—is next, and opens the door to entirely new forms of entertainment, where stories can be generated and explored on demand, free from the constraints and costs of traditional production. Over time, we believe everything that is video today—entertainment, ads, education, training, travel, and more—will evolve into interactive video, all powered by Odyssey.

The research preview we’re sharing today is a humble beginning toward this incredibly exciting future, and we can’t wait for you to try it and to hear what you think!

The team that brought this to life

This research preview was made possible by the incredible Odyssey team.

Technical Staff

Ben Graham, Boyu Liu, Gareth Cross, James Grieve, Jeff Hawke, Jon Sadeghi, Oliver Cameron, Philip Petrakian, Richard Shen, Robin Tweedie, Ryan Burgoyne, Sarah King, Sirish Srinivasan, Vinh-Dieu Lam, Zygmunt Łenyk.

Operational Staff

Andy Kolkhorst, Jessica Inman.

This isn't a solved problem

This research preview is by no means perfect, nor is this a solved research problem. If challenges at the frontier of AI appeal to you, we’re actively hiring across multiple roles—research scientists, research engineers, ML performance and systems engineers, data engineers, and more—in Silicon Valley, London, and remotely.

To get a sense of the kinds of challenges you'd work on, below are some fun failure modes we've observed with our next-generation world model. We hope you enjoy these weird and wonderful generations as much as we did.