Also, ongoing research into “world models” or “world simulators,” commonly associated with AI video synthesis models like Runway’s Gen-3 Alpha and OpenAI’s Sora, is leaning toward a similar direction. For example, during the debut of Sora, OpenAI showed demo videos of the AI generator simulating Minecraft.
Diffusion is key
In a preprint research paper titled “Diffusion Models Are Real-Time Game Engines,” authors Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter explain how GameNGen works. Their system uses a modified version of Stable Diffusion 1.4, an image synthesis diffusion model released in 2022 that people use to produce AI-generated images.
“Turns out the answer to ‘can it run DOOM?’ is yes for diffusion models,” wrote Stability AI Research Director Tanishq Mathew Abraham, who was not involved with the research project.
Credit:
Google Research
While being directed by player input, the diffusion model predicts the next gaming state from previous ones after having been trained on extensive footage of Doom in action.
The development of GameNGen involved a two-phase training process. Initially, the researchers trained a reinforcement learning agent to play Doom, with its gameplay sessions recorded to create an automatically generated training dataset—that footage we mentioned. They then used that data to train the custom Stable Diffusion model.
However, using Stable Diffusion introduces some graphical glitches, as the researchers note in their abstract: “The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8×8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD.”
An example of GameNGen in action, interactively simulating Doom using an image synthesis model.
And that’s not the only challenge. Keeping the images visually clear and consistent over time (often called “temporal coherency” in the AI video space) can be a challenge. GameNGen researchers say that “interactive world simulation is more than just very fast video generation,” as they write in their paper. “The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures,” including repeatedly generating new frames based on previous ones (called “autoregression”), which can lead to instability and a rapid decline in the quality of the generated world over time.