Google's Genie Creates Playable Worlds from Sketches

Google's Genie Creates Playable Worlds from Sketches
đź‘‹ Hi, I am Mark. I am a strategic futurist and innovation keynote speaker. I advise governments and enterprises on emerging technologies such as AI or the metaverse. My subscribers receive a free weekly newsletter on cutting-edge technology.

The field of generative AI has seen tremendous advances in recent years, with models capable of generating remarkably realistic images, videos, and text. However, most of these models focus on passive generation from a prompt. In their new paper, researchers from DeepMind introduce an exciting new paradigm - generative interactive environments.

Imagine sketching a whimsical landscape on a napkin over lunch and, by evening, stepping into it as a playable 2D world. That's not a page from a sci-fi novel; it's the premise behind Google's latest AI marvel, Genie.

Unlike the magical beings of lore, this Genie doesn't grant three wishes but offers endless possibilities to creators, transforming mere images into interactive experiences. Trained on a vast trove of gameplay footage, Genie crafts worlds more aligned with classic platformers than VR, but its implications ripple far beyond gaming.

The model can take a text or image prompt and generate an entire playable, game-like environment. What's more, Genie is trained without any action labels or supervision, using only raw internet videos of people playing games. This allows it to learn in a completely unsupervised manner, opening up the possibility of internet-scale training.

Under the hood, Genie consists of three core components: a video tokenizer, a latent action model, and a dynamics model. The tokenizer compresses the raw video frames into discrete tokens. The latent action model then infers a discrete set of "actions" between frames, despite no ground truth being available. Finally, the dynamics model takes the frame tokens and latent actions as input, and predicts the next frame in an autoregressive manner.

A key innovation is Genie's use of a spatiotemoral transformer architecture. By limiting self-attention to spatial and temporal dimensions separately, Genie can efficiently model long video sequences. Experiments confirm that Genie scales well as more parameters and data are added, culminating in an 11 billion parameter model trained on over 200,000 hours of gaming videos.

The results are seriously impressive. Genie can take sketches, text descriptions, and even photorealistic images as prompts to generate interactive game worlds. The latent actions provide smooth control, moving characters and objects accordingly. One remarkable demonstration is Genie's ability to emulate parallax - foreground objects moving faster than distant background ones.

While limitations remain in consistency and speed, the authors argue that Genie opens up many exciting avenues for future work. It could be a general simulation engine for training reinforcement learning agents or robots. More broadly, by unlocking creative interactive experiences from any user's imagination, Genie points the way towards more humanistic generative AI.

As we marvel at Genie's potential to democratize game design, we might also ponder: How will this technology influence our perception of creativity and authorship? Are we edging closer to a world where our imaginations are the only limits, or will these tools reshape our very concept of creativity?

Read the full article on Tom's Guide.

----

đź’ˇ If you enjoyed this content, be sure to download my new app for a unique experience beyond your traditional newsletter.

This is one of many short posts I share daily on my app, and you can have real-time insights, recommendations and conversations with my digital twin via text, audio or video in 28 languages! Go to my PWA at app.thedigitalspeaker.com and sign up to take our connection to the next level! 🚀

upload in progress, 0

If you are interested in hiring me as your futurist and innovation speaker, feel free to complete the below form.

I agree with the Terms and Privacy Statement
Dr Mark van Rijmenam

Dr Mark van Rijmenam

Dr. Mark van Rijmenam is a strategic futurist known as The Digital Speaker. He stands at the forefront of the digital age and lives and breathes cutting-edge technologies to inspire Fortune 500 companies and governments worldwide. As an optimistic dystopian, he has a deep understanding of AI, blockchain, the metaverse, and other emerging technologies, and he blends academic rigour with technological innovation.

His pioneering efforts include the world’s first TEDx Talk in VR in 2020. In 2023, he further pushed boundaries when he delivered a TEDx talk in Athens with his digital twin , delving into the complex interplay of AI and our perception of reality. In 2024, he launched a digital twin of himself offering interactive, on-demand conversations via text, audio or video in 29 languages, thereby bridging the gap between the digital and physical worlds – another world’s first.

As a distinguished 5-time author and corporate educator, Dr Van Rijmenam is celebrated for his candid, independent, and balanced insights. He is also the founder of Futurwise , which focuses on elevating global digital awareness for a responsible and thriving digital future.

Share