Google’s Vision for AI: Merging Gemini and Veo Models to Enhance Multimodal Understanding

In a recent discussion on the Possible podcast, co-hosted by LinkedIn co-founder Reid Hoffman, Demis Hassabis, CEO of Google DeepMind, unveiled plans to integrate Google’s Gemini AI models with its Veo video-generating models. This strategic move aims to bolster the AI’s comprehension of the physical world, marking a significant step toward creating a universal digital assistant capable of assisting users in real-world scenarios.

Hassabis emphasized that Gemini was designed from the outset to be multimodal, processing and generating various forms of media, including text, images, and audio. The integration with Veo is expected to enhance this capability by incorporating video data, thereby providing a more comprehensive understanding of the physical environment. This development aligns with the broader AI industry trend toward omni models—AI systems proficient in understanding and synthesizing multiple media forms.

The success of such multimodal models heavily relies on extensive and diverse training data. Hassabis indicated that Veo’s video data is predominantly sourced from YouTube, a platform owned by Google. By analyzing vast amounts of YouTube content, Veo can discern the physics of the world, enabling the AI to interpret and interact with real-world scenarios more effectively. This approach underscores the importance of diverse and comprehensive datasets in advancing AI capabilities.

The integration of Gemini and Veo is part of Google’s broader strategy to develop AI systems that can perform complex tasks autonomously. Hassabis highlighted that the next iteration of AI will focus on agent-based systems with traits such as planning, acting, reasoning, better memory, personalization, and the ability to use tools. These systems are envisioned to think ahead, plan trips, book tickets, and take actions in the real world, moving beyond passive information retrieval to active problem-solving.

This vision is rooted in DeepMind’s history of developing AI systems capable of planning and executing actions to achieve objectives. Hassabis noted that from the beginning, DeepMind’s systems were designed to plan, carry out actions, and achieve goals, laying the groundwork for the development of more advanced AI agents. The integration of Gemini and Veo is a natural progression in this journey, aiming to create AI systems that can reason, break down problems, and carry out actions in the world.

The development of such advanced AI systems is not without challenges. Hassabis acknowledged that the industry is experiencing diminishing returns from merely increasing the size of large language models. To achieve artificial general intelligence (AGI), he emphasized the need for two or three more significant breakthroughs, similar to past innovations like deep reinforcement learning and transformers. These breakthroughs are essential to develop systems that can reason, plan, and interact with the world in a manner akin to human intelligence.

Google’s commitment to advancing AI is evident in its recent initiatives. At the Google Cloud Next 2025 event, the company unveiled major advancements in AI, cloud computing, and enterprise solutions. Highlights included the introduction of the Ironwood TPU, capable of 42.5 exaflops, and plans for a $75 billion capital investment in 2025 focusing on data centers and infrastructure. Key upgrades to Gemini AI, including Gemini 2.5 and Flash, were also announced, along with the launch of AI agents across industries with tools like Agentspace and the Agent Development Kit. These developments underscore Google’s dedication to maintaining its leadership in the AI space.

The integration of Gemini and Veo is a testament to Google’s vision of creating AI systems that can understand and interact with the world in a multimodal and comprehensive manner. By combining the strengths of both models, Google aims to develop AI systems capable of performing complex tasks autonomously, marking a significant step toward the realization of artificial general intelligence.