July 13, 2024

It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More


Today, promising generative AI video startup Runway ML unveiled its newest foundation model, Gen-3 Alpha, which allows users to generate high-quality, ultra realistic scenes 10 seconds long, and with many different camera movements, using only text prompts, still imagery, or pre-recorded video.

We had the chance to speak with Runway’s co-founder and chief technology officer (CTO) Anastasis Germanidis about Gen-3 Alpha and its place in the overall, fast-moving, increasingly competitive video AI software space, and how Runway views its place in the market.

We also learned the rollout plan for Gen-3 Alpha (paid users first in the coming days, free users after) and how Runway plans to meet the competition head on.

Read on for our exclusive interview (edited for length and clarity):


VB Transform 2024 Registration is Open

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now


VentureBeat: I know you have a busy day, obviously, and congratulations on the launch of Gen-3 Alpha…tell me a little bit about how Gen-3 is different than Gens-1 and 2? What are the biggest differentiators?

Anastasis Germanidis: Since we released Gen-2, more than a year ago now, we learned a few different things: the number one thing was, when we first released Gen-2, you could only prompt it in a very simple manner with text, and we quickly added a lot of controls around camera motion, around object motion, and that proved to be really important for how people are using Gen-2 today.

So in Gen-3 Alpha, we basically invested a lot more time and resources in that, and spent a lot of time on the captioning of data that gets put in while doing training.

Now you can prompt it with really complex interactions, and you can prompt it with how the camera moves, in different styles, in the interaction of characters. So that’s a big focus, number one.

Number two is just the scale of compute. It’s been proven on the language domain how much scaling compute can lead to a much greater range of capabilities for models. And we’re seeing the same with video.

With increased compute we saw that the model was able to learn things like geometric consistency of objects and characters — not morphing as the video progresses over time, which has been an issue with previous video generation models. We learned the prompt adherence improves quite a bit as you scale.

The third thing that we really paid attention to is this idea that Gen-3 is the base model, and we’re really building a modular framework where you can easily plug in things like image-to-video, things like camera control in a way that leverages the base model in all its capabilities, but also kind of adds those additional layers of control.

And this will allow us to essentially bring a lot more tools faster on top of Gen-3 than we were before. And that’s what you’re going to see in the coming months: a lot of tooling that’s on top of Gen-3, and different kinds of tools for different use cases and different kinds of generations.

In the last couple weeks, we had [rival AI video model] Kling come out from China, and then we had Dream Machine from Luma AI. You were already going to do some sort of Gen-3 regardless of what anybody else was doing, I presume. But how much does the competition and the fact that users are excited about these other models factor into your plans for what you’re developing and what you’re offering?

As you mentioned, Gen-3 was in the works way before those newer models came into play.

For us, it’s, it’s definitely important that we remain state of the art and that we’re able to provide really amazing results for users. It’s the foundation and the starting point, but the base model is not necessarily a long-term differentiator. It’s really what needs to be there in order for all the other controls and like tools to come forward.

What’s different about the way we build things is that now Gen-3 is going to be deployed on this existing, very rich set of tools that are being actively used, that have been fine tuned the input of the community over over years, versus a model that doesn’t have this whole existing infrastructure to leverage.

I think that’s the most important piece: that the current product of Runway is kind of a distillation of so much input from artists over the past five years, and Gen-3 is going to arrive embedded into that.

I remember when Sora was first showed off [in February 2024]; it was this amazing tool. OpenAI still hasn’t actually released it yet, but I remember so Cristóbal Valenzuela, your CEO and your co-founder, he tweeted something like “game on.” So surely, you guys are paying attention to what else is out there.

Where do you see the market going? Are users going to have different AI tools that they can use for different purposes? Or is it winner take all? How many AI video models do you think realistically can exist and be profitable or have a sustaining business?

Taking a step back and going into a bit of a long-term view: we can imagine, everyone’s going to have a photorealistic video generation model in two years. That’s going to be table stakes, at that point. So what makes a difference?

It’s really having great use for using those models and building the community around those models, and on creating.

The way we’re thinking about this is that a new genre is emerging around AI film and it’s not just about releasing models, but really working with with folks to make sure both that those models are usable for them, and also that we can showcase their work and make sure we can support the community.

The promise and potential of video generation is vast. Video is the majority of the traffic on the internet, and there are so many different use cases for it that go beyond content creation and storytelling, even though that’s our focus.

I think it’s going to be a very, very large market with players in different narrow and domain-specific areas. I don’t think we’re going to be the only ones, but I do think our approach of in terms of building really controllable storytelling tools that address professional creators is going to remain the biggest differentiator for us. And it’s very difficult to build out this platform, built out this community, unless you come from the art world or have the mindset of building creative tools and not just research.

In the announcement for Gen-3 Alpha, it says you worked hand-in-hand with artists and filmmakers and creatives to train this model and develop it. Are you prepared to share the names of any of the filmmakers, or are they going to be able to come out and share? When are we going to learn more about who’s involved in this?

To clarify about the messaging of the blog post, we have an in-house creative team that’s been collaborating closely with researchers on the team to build this model. So it’s been primarily that in-house team of artists and video creators, VFX artists.

Oh, so like Nicolas Neubert?

Exactly. Yeah!

It also mentions in the blog post that you are working with other media and entertainment companies on custom models of Gen-3 Alpha. I think people were excited about that as well, to have a specific model for their needs or their characters, or storytelling or style. Are you prepared to name any of the media organizations that you are working with on that, and what are the big differences between the you can get between the stock Gen-3 Alpha and custom versions?

We haven’t publicly announced a lot of those partnerships. One that we have announced in the past has been Getty Images. We work closely with them to build custom models for specific enterprise customers.

Gen-3 Alpha is much more versatile than Gen-2, but you still get some performance improvements for fine-tuning, in terms of being able to generate the same character consistently across shots, or capturing exactly a specific style or brand.

So I don’t imagine that going away, even though fine-tuning could be faster or easier for us with a better base model — it’s still very relevant.

You mentioned in the blog post that Gen-3 Alpha was trained on text and images and video. Is that different than prior models?

Generally, video generation models are trained on paired text and video data. The difference in this case has been how descriptive and detailed the captions were. And traditionally, the way those older models have been trained is you have a single short caption that describes a whole video.

In our case, we have really detailed captions and also multiple captions throughout the course of the video, so you get a much more precise definition of like, the details of the scene and also what changes during the scene.

Are those captions added by your employees or some contracting firm that you work with to annotate them, or who’s supplying that kind of information so that you can train?

We do that captioning work ourselves.

Already, people are asking —as with all AI models— where the training data for Gen-3 Alpha comes from. You mentioned you work with Getty Images, so presumably, there would be some information share there, but correct me if that’s erroneous. Are there other training data sources that you’re prepared to disclose? You guys are being sued, OpenAI is too, for training on copyrighted data. A lot of these AI models are being sued for copyright infringement for the training aspect. I’m curious about what you can say about the training data in this case: what’s what’s in the mix this time — is it licensed or publicly available or what?

The training data we use are proprietary, so I can’t share a lot more details on that. We have a lot of data partnerships that that go beyond the Getty data partnership, but we haven’t publicly shared those. It might might happen the future. But we work with with companies that provide highly curated video data.

So you’re not able to say beyond that, how much it is licensed versus non-licensed, or publicly available?

The terms of the training data that we use are proprietary information at this point. We might disclose more details on the future, but at this point we can’t share more.

Editor’s note: a Runway spokesperson emailed VentureBeat after this interview with the following information regarding Gen-3 Alpha training data:

We have an in-house research team that oversees all of our training and we use curated, internal datasets to train our models. 

The Gen-3 Alpha announcement post didn’t mention exactly when the model would be available to users, nor did your posts on X, other than soon. It’s not available immediately today. Do we have a sense of timing? Is it going to be days, weeks, or when can people actually expect to get their hands on it?

It’s definitely going to be more like days than months. Right now, we’re in the process where we want to make sure we’re going to have enough capacity to be able to serve that model, and that’s why we’re going to do a gradual rollout to paid users first, but then we hope to make it available to all free users, too, in the span of the coming days.

People already use Runway Gen-1 and Gen-2 for for professional Hollywood films. And there’s this interest from the filmmaking industry in some of these AI models, with reports of OpenAI and Meta and others going to Hollywood. How much of Gen-3 Alpha do you view as a professional filmmaking tool versus for the indie creator and the individual? Which market do you see as the primary user, or maybe it’s both?

I think the perspective we’ve always had on this is that the usage of those models is a spectrum. We’ve had professional filmmakers use Gen-1 and Gen-2, but of course, they didn’t generate every single shot with those models, necessarily. It might have been some B roll footage, or it might have just been used like a storyboarding tool to get a preview or an 80% version.

With Gen-3 Alpha, we expect more and more of that will go into production. But again, I don’t think it’s like a kind of 0 to 1 thing, where we want to convert every single part of the filmmaking process to use those models. I think it’s going to be more of a gradual transition.

And why should people invest time and money in AI tools as opposed to going out and filming with more traditional means, just a camera or [Adobe] After Effects or something?

The speed of iteration is the number one thing. Just having an idea and being able to turn it into a version on film as quickly as possible is really valuable. You don’t have to travel somewhere to plan to take a shot in the real world. You can now, if you have an idea, bring it to life very quickly.

But I think it can always be a combination of traditional filmmaking techniques and those generative models.

For filmmakers, you might have a strange idea that you don’t know if it’s going to work, and it’s not worth investing weeks or years to turn it into reality. Those strange ideas might end up becoming your best ideas, but you don’t know. You often won’t know, because it isn’t worth the risk/reward of just turning them into a form — and now you can. And so we expect more more of those ideas to finally see the light of day.

Do you see a future where fully generated AI film maybe dominates the form? Or how do you see it? How long will generative AI remain a component of the process versus the entire process?

I think it’s definitely going to grow and grow in terms of its place in the process. What we saw with our AI Film Festival that we did last month was there was a mix of 100% generated and partially generated films. I think that mix it still going to remain.

It’s never going to be you have a single prompt and you generate the whole film. It’s always going to be a very iterative process where your intention as an artist is guiding the generation, and you are refining your idea as you’re creating with a tool.

Even if, technically, all the video outputs are generated with a model like Gen-3 Alpha, I still would not call it 100% AI generated, because it involved a lot of human input to actually get to the final result.



Source link