TwelveLabs Raises $100M Series B to Expand Enterprise Video AI
TwelveLabs, a San Francisco-based enterprise video AI company led by co-founder and CEO Jae Lee, raised a USD $100M Series B to expand its video-native multimodal foundation models. The round was co-led by NEA and NAVER Ventures, with participation from Amazon, Radical Ventures, Korea Investment Partners, Index Ventures, Quadrille Capital, and Red Bull Ventures.
The investment arrives as enterprises confront an uncomfortable reality: video has become one of the largest and least searchable sources of organizational data. Cameras record everything, but finding the moment that matters is still painfully hard, which is exactly the gap TwelveLabs is trying to turn into infrastructure.
For TwelveLabs, this is more than another venture round. It is a signal that video-native artificial intelligence is moving from experimental capability toward enterprise systems that can search, organize, and reason across the footage companies already own.
What Happened
Founded in 2021, TwelveLabs develops multimodal AI models built specifically for video understanding. Rather than adapting text-first large language models to visual content, the company's architecture starts with video as the primary intelligence layer.
The USD $100M Series B was announced on July 1, 2026. No valuation was disclosed. The announced investor group includes co-leads NEA and NAVER Ventures, along with Amazon, Radical Ventures, Korea Investment Partners, Index Ventures, Quadrille Capital, and Red Bull Ventures.
The company said the new capital will accelerate research and development across perception, knowledge, reasoning, and orchestration systems. It also plans to keep investing in San Francisco and Seoul while supporting expansion through New York and London.
Why TwelveLabs Is Different
The AI market is crowded with companies adding video features to existing language-model workflows. TwelveLabs has taken the harder route by building models around the complexity of video itself, where meaning depends on speech, sound, motion, visual context, and time.
Its flagship models include Marengo 3.0 and Pegasus 1.5. Marengo 3.0 focuses on video embedding and retrieval, while Pegasus 1.5 converts video into structured information that software can use for reasoning, summarization, and analysis.
That matters because searching a video archive is not the same as creating persistent video memory. Rather than processing each request as an isolated lookup, persistent memory means an AI system can retain structured understanding across entire video libraries and help organizations turn footage into institutional knowledge.
Rodeo, TwelveLabs' first application-layer product, points to that next layer. The company is not only building foundation models; it is also trying to reduce the work enterprises face when deploying video intelligence inside real operating environments.
Why Investors Are Paying Attention
Venture capital has become more selective as enterprise AI moves beyond demos toward measurable business value. TwelveLabs' investor syndicate combines established venture firms with strategic participants that understand infrastructure, distribution, and enterprise adoption.
Amazon's participation is especially important because TwelveLabs already has a deeper technical relationship with AWS. The company's models are available through Amazon Bedrock, which lets enterprises access foundation models through AWS-managed infrastructure, and TwelveLabs is working with AWS to optimize inference workloads on AWS Trainium.
Infrastructure partnerships like that usually matter more than the announcement headline. They suggest the company is being evaluated not only as a model developer, but as a potential video intelligence layer for enterprise AI systems that need scale, reliability, and deployment paths buyers already trust.
Market Context
Enterprise AI has spent the last several years focused heavily on text, but video is a more difficult and increasingly strategic category. Every second of video contains visual signals, audio, language, motion, timing, and relationships between events, which makes the data richer and harder to structure than a document.
At the same time, organizations keep generating video across security systems, manufacturing facilities, healthcare environments, transportation networks, media production, advertising workflows, sports operations, and connected devices. The gap between captured information and accessible knowledge keeps widening as more footage accumulates.
TwelveLabs is positioning itself inside that gap by treating video as a knowledge system waiting to be indexed, structured, and understood. That aligns with a broader enterprise AI shift from content generation toward operational intelligence, where buyers want systems that retrieve institutional knowledge and support decisions from existing data assets.
What This Signals
One of the easiest mistakes in technology is confusing noise with momentum. Funding announcements generate headlines, but durable infrastructure changes industries, and TwelveLabs appears to be building toward the second category.
The roadmap extends beyond foundation models into application-layer products like Rodeo while continuing investment across perception, knowledge, reasoning, and orchestration. Customers increasingly care less about the model in isolation and more about whether the technology integrates into workflows, scales reliably, and produces useful answers from proprietary data.
If enterprises keep prioritizing operational intelligence over isolated AI features, video-native reasoning platforms could become foundational components of modern information architecture. TwelveLabs' Series B suggests investors believe the category is large enough to deserve infrastructure-scale investment.
The Bigger Industry Shift
The most valuable AI companies may not be the ones generating the most content. They may be the ones helping organizations understand decades of content they already own, especially in formats that have historically been difficult to search or reason over.
Video is one of the world's richest information sources while remaining one of its least accessible. TwelveLabs' Series B does not prove the company will define the category, but it does show that enterprise AI is expanding beyond language into memory, context, and reasoning across every form of organizational knowledge.
Frequently Asked Questions
Why is video-native AI different from text-first AI for enterprises?
Video contains speech, sound, motion, visual context, and timing in the same asset, so it is harder to search and reason over than text. TwelveLabs is building models around video as the primary data type rather than treating video as an add-on to a language workflow.
Why does Amazon's participation matter in TwelveLabs' Series B?
Amazon is not only an investor in the round. TwelveLabs' models are available through Amazon Bedrock, and the company is working with AWS on Trainium-based inference optimization, which gives the startup a clearer enterprise distribution and infrastructure path.
What does this funding signal about enterprise AI infrastructure?
The round suggests investors see video intelligence as a serious infrastructure category, not a narrow media feature. As companies accumulate more video data, tools that can search, summarize, and reason across that data may become part of the enterprise AI stack.
What should operators watch next from TwelveLabs?
Operators should watch how TwelveLabs turns its foundation models into application-layer products such as Rodeo and how deeply its AWS relationship supports production deployments. The key question is whether the company can make large video archives usable inside everyday enterprise workflows.









