OpenAI, an AI startup, has launched Sora, a text-to-video model that may push the boundaries of generative AI.
Similar to Lumiere, a text-to-video application from Google, Sora has restricted availability. Sora can produce videos up to one minute in length, in contrast to Lumiere.
Building on the announcement from Sora, ElevenLabs, an AI voice generator, said a few days later that it is developing text-generated sound effects for videos.
As OpenAI, Google, Microsoft, and others look beyond text and image generation and seek to solidify their position in a sector projected to reach $1.3 trillion in revenue by 2032, text-to-video has emerged as the latest front in the generative AI arms race. The goal is to win over consumers who have been intrigued by generative AI since ChatGPT arrived a little more than a year ago.
In an article published on Thursday, OpenAI—the company behind ChatGPT and Dall-E—stated that Sora will be made available to “red teamers,” or specialists in fields like bias, hate speech, and disinformation, who will be “adversarial testing the model,” in addition to visual artists, designers, and filmmakers seeking further input from industry experts.
Adversarial testing will be crucial in addressing the issue of convincing deepfakes, which is a big worry when using AI to produce images and videos.
The AI startup stated that it hopes to “give the public a sense of what AI capabilities are on the horizon” by sharing its accomplishments and soliciting input from outside the organization.
Strengths
Sora’s capacity to decipher lengthy prompts—one example weighing in at 135 words—may be what makes it stand out. Sora can generate a wide range of characters and sceneries, from humans and animals to fluffy monsters and cityscapes, landscapes, zen gardens, and even an underwater version of New York City, as seen in the sample films OpenAI posted on Thursday.
This has been made possible in part by OpenAI’s earlier work with its GPT and Dall-E models. September saw the introduction of Dall-E 3, a text-to-image generator. Stephen Shankland of CNET referred to it as “a big step up from Dall-E 2 from 2022.” (GPT-4 Turbo, OpenAI’s most recent AI model, was delivered in November.)
The sample movies that OpenAI posted do seem incredibly lifelike, with the possible exception of scenes where a human face is shown up close or sea life is swimming. If not, it could be difficult for you to distinguish between what is and isn’t genuine.
Like Lumiere, the model can also create videos from still photos, add to already-existing recordings, and fill in any missing frames.
Weakness
“For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark,” the post said.
And rest assured that Sora also conflates left and right, so anyone who still has to form a L with their hands to determine which is left can find comfort in this.
When Sora will be made publicly available is unknown, but OpenAI stated that it wanted to perform “several important safety steps” first. This includes adhering to OpenAI’s current safety guidelines, which forbid graphic violence, pornographic material, hate speech, the likeness of celebrities, and the intellectual property of others.
Sound Effects
ElevenLabs on Monday detailed how it created audio prompts such as “waves crashing,” “metal clanging,” “birds chirping,” and “racing car engine” to overlay on some of Sora’s AI-generated movies for added effect in a blog post about AI sound effects.
A release date for ElevenLabs’ text-to-sound generating tool was not disclosed in the post, but it did state, “We’re thrilled by the excitement and support from the community and can’t wait to get it into your hands.”