Stable Video Diffusion (SVD) - the next cool AI toy in town to look out for!
- Shivendra Lal
- Sep 1, 2024
- 4 min read
There's another AI toy in the market, and it is supposed to have fair amount of, in air quotes, Stability. Okay, that was a bad joke. Spare me this one. So, Stability AI announced Stable Video Diffusion a few weeks back and it has been getting a fair bit of public attention.
There are so many of these AI-powered toys dropping in that, to be honest, I find it exhausting to even keep track. This one caught my attention because it seems interesting and promising.
What's the fuss about Stable Video Diffusion (SVD)?
Stability Video Diffusion is a video generational AI model that can generate 25 video frames at a resolution of 576 x 1024 pixels. In a world that is gradually getting used to 4K resolution videos this doesn't seem much. But don't get distracted by resolution just yet.
SVD can adapt to wide-ranging video applications. Basically, it can generate 3D structure to tackle the problem of generating multi-view images from a same image.
It is capable of generating videos with 14 and 25 frames at a customizable frame rates between 3 and 30 frames per second. In simple terms, you can input an image and it will generate a short video with 576 x 1024 pixels based on the input image. Now, this is no small feat by any standard!
What's under the hood?
While SVD is available for research purposes only, and there is a waitlist to get access to the tool, the underlying approach taken by Stability AI is quite impressive.
Instead of using a massive dataset to pre-train their model, Stability focused on curating high quality data. The comparatively smaller dataset of 2.5 million curated videos allowed them focus on better alignment with the prompt which was also preferred by the users.
The dataset utilized for training the model comprised for 5 key components:
The team focused on detecting and separating the edited videos with multiple scene to ensure that scene cuts do not mislead the AI model.
In order to generate captions for video clips with higher accuracy, they used Google's Contrastive Captioners or CoCa image-text foundation models. Using this method will ensure that high levels of accuracy can be achieved when generating a video from text input.
While training a model based on a complex dataset of subsampled videos, Stability AI team ensured that they separated videos with motion from static videos in the dataset.
They also cleaned up the dataset by identifying and removing clips that contained a lot of text to make sure that the model was trained primarily on video content.
And last but not the least, the entire dataset was refined by assessing the aesthetic appeal and text-image alignment using CLIP-based scoring.
But, it's still buggy…
All this is heavy duty jargon which is very hard to comprehend. Even I find it tasking to explore a new AI toy that has entered the market. Let me simplify all this for you. Stability AI utilized the dataset for SVD that ignores any misinterpretation of video content due to video editing. Each video clip had an associated text caption using the CoCa model. The model generated text based on what was in the video clip for better understanding of what actually is in the video. The team also segregated videos with motion and the ones that were static. Then, they had all the video text removed, and focused on the aesthetic appeal of the video.
Like I said before, this is quite impressive. However, it is important to note that SVD is still a work in progress and is available primarily for research purposes. This AI-powered toy does not have not all bells and whistles yet, and has issues that need fixing:
At times it has been found to struggle with understanding the complexity of 3D scene in the sampled videos.
Maintaining the quality of the video as some pixels might appear blurry.
The supported resolutions are currently limited and the model has been found to be struggling with scaling videos to high resolutions.
Overall length of the videos generated are not more than 4 seconds.
The model does not generate videos that are photorealistic. They still appear synthetically generated.
The videos generated may lack motion or limited motion like clouds moving of a mountain range or camera slightly paning to right or left of an individual.
What promises does SVD hold, and why marketers should look forward to it?
SVD is a toy that needs more work. But, looking at the quality-focused approach of data curation and current state of its capabilities, marketers must keep an eye on it. SVD can be the next cool toy in town as it can help marketers with:
Generating videos that show products in action.
Tailoring video ads to users based on demographics, interests, and past behaviour.
Developing engaging and visually striking videos for various social media platforms like TikTok, Instagram, and LinkedIn.
Creating clear and concise explainer videos.
A/B testing with multiple brand storylines.
Adopt immersive and emotive storylines that resonate with the target audience.
And do all of this at a fraction of time and cost required for producing engaging video content.
Comments