Back to all posts

Our Approach to Investing in Multimodal AI


Andy Sran + Paige Doherty

Introduction to Multimodal AI 

Multimodal AI combines, interprets, and outputs multiple data types – including text, photographic images, audio, and video – in an attempt to emulate human-like perception. This elevated integration allows MMAI to respond to vastly complex and multi-faceted scenarios and create systems that are more intuitive, interactive, and capable of handling real-world complexities with unprecedented accuracy, much as how a human would.

MMAIs origins has its roots in both academia and the private sector. In academia, MMAIs journey can be traced back to the exploration of fundamental neural networks and machine learning applications, which primarily focused on single-mode data processing. This groundwork was instrumental as it established the foundational building blocks used in later multimodal research (such as the development of convolutional neural networks for image processing and recurrent neural networks for sequence data, like text and audio). These technologies set the groundwork for the more complex multimodal interactions that we’re beginning to see now.

The private sector played a crucial role in accelerating the development and application of MMAI. Companies like Google, IBM, and Microsoft invested heavily in research and development, leading to breakthroughs like Google's Multimodal Neural Machine Translation system, which integrated text and images for translation. This system was a significant advancement, as it demonstrated the practicality and effectiveness of combining different data types for complex AI tasks. Later, MMAI benefited from the advent of the transformer model, which further bolstered the capabilities and robustness of the technology.

The Difference between Text-Based AI and Multimodal AI

On November 30, 2022, most mainstream consumers were first introduced to Open AI’s text-based AI ChatGPT, which at the time primarily specialized in processing and generating text. ChatGPT, along with competitors such as Google’s Bard and Anthropic’s Claude, are language-based models, meaning they primarily understand and produce text-based content: the input and output are both text-based. 

What are text-based models uniquely good at? These models are highly effective for tasks like conversation, text content generation, rudimentary data interpretation, answering questions, and text completion. ChatGPT and other similar models learn from immensely vast libraries of text data, allowing them to generate responses that are contextually relevant.

On the other hand, Multimodal AI goes beyond just the singular input of text. MMAI integrates and interprets multiple types of data inputs, including text, images, audio, and video – simultaneously. Additionally, the output can also be multimodal in nature; for instance, text-based input can result in video-based output. This unique capability elevates the experience dramatically, allowing MMAI to have a more comprehensive understanding of complex scenarios, which again aims to mimic how humans perceive and interact with the world using multiple senses. 

The Convergence 

At its core, MMAI is primarily about one concept: convergence. It blends different forms of data inputs - text, images, audio, video, and more  - to generate a more holistic picture of the world as output. What’s particularly unique about this approach is that it comes closer to mimicking how we as humans process information, integrating sights, sounds, and other senses to form a complete picture of our own surroundings. 

Our own senses – sight, sound, touch, taste, and smell – ultimately work in harmony, not in isolation. This multisensory integration allows us to perceive the world in a rich, multi-dimensional way. Our brains do not just process these sensory inputs; they weave them into a cohesion of experience, memory, and emotion. MMAI seeks to emulate this exact human capability. It converges a series of diverse data streams – visual images, auditory sounds, textual information – to understand and interact with the world and produce an output. 

Just as humans use their senses to navigate their environment, MMAI uses its multiple modalities to process information more holistically and accurately.

The Challenge in Multimodal AI: Inherent Technical Complexity

Due to its inherent complexity, MMAI is much more technically involved than unimodal AI and as a result, proper and accurate execution is difficult. From a technical standpoint, it’s important to understand that MMAI provides output through a series of unique, data-specific procedures, as opposed to just one singular process. 

This process begins with the initial acquisition/input of different data forms, be it text, images, audio, or video. Each data subtype is pre-processed to fit a respective analyzable format, with images undergoing normalization and resizing, text being tokenized and vectorized, and audio converted into analytical representations (like spectrograms).  Each data stream is then processed through various models tailored to that specific modality. MMAI systems generally leverage neural networks, particularly deep convolutional neural networks or transformers for processing sequential data, like text and speech. These networks, however, are notorious for requiring immense computational power, often necessitating the use of advanced GPUs or TPUs for parallel processing and quick data handling. 

The challenge for MMAI lies not just in managing diverse data types, but in synthesizing them to produce an accurate and meaningful output. This requires not only strong algorithms and an intense level of computational resources, but also intricately designed models that can capture the nuances of each data type while providing a coherent and unique output. It’s an art as much as it is a science.

Partly due to its technical complexity, MMAI still faces roadblocks in obtaining widespread, mainstream adoption. Notwithstanding immense levels of computing power, MMAI systems also require vast and diverse datasets to be effectively trained, and these datasets must be accurately labeled, a process that can be time-consuming and labor-intensive. 

Another major obstacle is the complexity involved in integrating and synchronizing different data modalities. Each type of data – be it visual, textual, or auditory – has its unique characteristics, requiring specialized processing techniques. Developing models that can effectively handle this multimodal data integration, while maintaining accuracy and efficiency, is a significant technical challenge: it takes time.

The Challenge of Multimodal AI: Data Privacy

Additionally, challenges such as data privacy, ethical considerations, and the need for robust and unbiased datasets will also need to be addressed. As MMAI systems become more integrated into our everyday life, the way they handle and process our personal data becomes a critical concern. 

Unsurprisingly, data privacy issues arise from the extensive collection and analysis of personal information, which if not properly managed, could lead to breaches or even misuse of said sensitive data. Ensuring the privacy and security of user data requires stringent protocols, and possibly even new regulatory frameworks.

The Good News

Obstacles aside, MMAI is poised to be transformative, with the potential to significantly shape numerous aspects of technology and daily life, with 2024 shaping up to be the breakout year for MMAI. As computational capabilities continue to advance, we can expect MMAI systems to become more sophisticated, enabling more seamless and intuitive interactions between humans and machines. Applications can be widespread: entertainment, healthcare, ed-tech, and more can all benefit from MMAI. 

Below, we explore some predictions for MMAI in 2024, where this technology is expected to make significant strides in various domains, reflecting an acceleration in AI technology that we expect will be substantially elevated from where we stand today. The following predictions build on our earlier piece on “Investing in Early Stage AI” 

2024 MMAI Trends We’re Watching

1. Incumbent Breakthroughs -> A New Startup Ecosystem

Whether it’s Open AI’s GPT Vision or Google’s Gemini model, incumbents are investing tremendously into MMAI. With companies able to build atop these multimodal AI models, the possibilities for disruption are endless as startups can leverage these incumbents’ research breakthroughs and can utilize their APIs. Expect a wave of companies innovating in verticals far and wide, from healthcare to ed-tech to entertainment and much, much more.

2. The Tangible Convergence: MMAI and Wearables

Meta launched their collaboration with Ray Ban, Humane released their AI pin, and OpenAI has been said pursuing an entry into the wearable space. Wearables will increasingly become integrated with advanced MMAI, transforming how we interact with our digital and physical environments. We expect a greater symbiosis between MMAI and humans, facilitated by verticalized wearable technology.

3. Video Generation MMAI: MMAI-Generated Deep Fakes

MMAI-generated deep fakes have always been a concern in the evolving landscape of AI, even from the earliest days. While traditional deep fakes mostly involve manipulating images or videos, multimodal AI deepfakes take this a step further by integrating various data types, making them more sophisticated and extremely difficult to detect. Per James Lindsay of Stanford’s Human Centered Artificial Intelligence Lab: “I expect to see big new multimodal models, particularly in video generation. Therefore we’ll also have to be more vigilant to serious deep fakes — we’ll see the spread of videos in which people “say” things that they never said. Consumers need to be aware of that, voters need to be aware of it.”  Especially in an election year, we're closely watching for Ai-driven applications of media literacy & content moderation.

4. Spatial Sensors: The Inevitable Intersection of AR/VR x MMAI

At Behind Genius Ventures, we’ve been investing in consumer hardware-enabled software like Aviron, The Last Gameboard, Hearth Display, and Lotus Labs since 2021. Most VCs shy away from hardware, but we believe deeply in its ability to transform the world as we know it and produce meaningful returns. 

So what happens when hardware comes into the loop? Our prediction is that specialized vertical hardware companies will fill the gaps - companies that are really good at one modality or the other (touch, audio, etc) and we’re on the lookout for innovative solutions in the space. We’re curious to see if there will be a next generation “Home Brew Computing Club,” where folks jam on innovative hardware ideas. It’s unlikely a perfect solution will be the first one announced, as we’re still super early - so experimentation is important.

Spatial computing vs multimodal AI: which buzzword will win? 

Apple’s Vision Pro will likely be for AR/VR what ChatGPT was for AI. AR/VR and MMAI will interact by combining AR/VR's immersive environments with MMAI's ability to process and respond to various data types, creating more interactive, responsive, and realistic virtual experiences. The applications here as well are vast; with Apple’s design acumen, brand power, and immense distribution advantage, the Apple Vision Pro has a chance to be one of the more revolutionary technological shifts in recent memory, and it will be powered in large part by on-device MMAI. 

Apple’s submission guidelines for the Vision Pro App Store specifically instruct developers to not describe applications as AR, VR, XR, or MR. Instead, they are requiring the term “spatial computing.” Expect this word to be used far and wide in 2024, often directly alongside (or as a substitute for) multi-modal AI.

In Conclusion: Remaining Cautiously Optimistic

Above, we provide an introduction to Multimodal AI (MMAI), explaining its roots in both academia and the private sector. Multimodal AI combines and interprets various data types, such as text, images, audio, and video, aiming to emulate human-like perception. We outline the difference between text-based AI models, like ChatGPT, and Multimodal AI, emphasizing how the latter integrates multiple data inputs simultaneously, providing a more comprehensive understanding of complex scenarios.

The core concept of MMAI is convergence, blending different forms of data to generate a holistic picture, similar to how humans process information through multiple senses. We also discuss the technical complexity and challenges of Multimodal AI, including the need for diverse datasets, computational power, and addressing data privacy concerns. Despite the obstacles, we anticipate transformative breakthroughs in 2024, with MMAI making significant strides in various domains, such as healthcare, ed-tech, and entertainment. Predictions for 2024 include breakthroughs from incumbent companies, the integration of MMAI with wearables, the emergence of MMAI-generated deep fakes, and the intersection of AR/VR with MMAI through spatial computing.  

We welcome thoughts, ideas, and discussions around this topic - we remain, as ever, cautiously optimistic about the future.