The Future of Video Marketing: Inside ByteDance's Game-Changing Boximator
- Shivendra Lal
- Sep 3, 2024
- 4 min read
Have you seen James Gerde's AI video art channel on Instagram? You should check it out if you haven't. He's an early adopter of AI-based video creation, where he takes videos from other content creators and combines them with text prompts to make videos. This guy's stuff is a great example of how AI can help humans create better content.
Social media is slowly getting dominated by synthesised images and videos. My earlier episode talked about AI-based content generation's incredible marketing potential. I'll put a link in the description. Getting back to synthesized content. It's an exciting time to be an artist, a content creator or a marketer. And ByteDance, the company behind TikTok, announced something even more exciting! We'll start by looking at some of the key challenges diffusion models face, then we'll get into what ByteDance has achieved...
Challenges in synthesized video generation
There's something magical about the kind of video content that's being shared online. Anybody who's worked in video production knows what I mean. Despite all the bells and whistles we get to experience in the synthetically generated videos, the underlying diffusion models still don't get the audience to completely disbelieve what they are watching, which is key to good storytelling.
They can generate high quality images from text prompts. It's a whole new ballgame when it comes to synthesized videos. A simple text prompt like "a squirrel on the roof of a house" will show a squirrel on top of a house, but may not show seeds that would make the frame richer and more engaging. It would take multiple iterations of the prompt to get the desired output.
Diffusion models struggle to compose scenes with multiple key objects, like characters in specified positional relationships. Taking the squirrel example further, let's imagine it's joined by another squirrel carrying nuts. The interaction between the squirrels, nuts, the rooftop element, weather, light, etc., might seem pretty obvious to a human. We've all seen similar references in movies or in real life. Diffusion models have to keep all the objects throughout the sequence and introduce new ones when needed; manage their individual positions; maintain interactions between objects; and make sure the transition between frames is smooth.
Today, diffusion models struggle with bringing all of these aspects together, especially with text-based prompts. Creating a synthetic storytelling video requires consistent control over multiple objects' positions, explicit control over their positions, seamlessness and consistency, and simplicity. Text prompts make it difficult to generate story-based videos.
Using text prompts for video generation also makes it difficult to capture an object's shape or size, so modifying poses of a subject or distance between objects is nearly impossible.
Enter Boximator and its features
The research team at ByteDance claim to have solved this problem through a new approach of object control using boxes. Their approach uses two types of boxes - hard box and soft box - that helps get a fine-grained motion control. Users can select objects in a frame using hard boxes for defining specific objects which may be a human or a mobile phone, and then use either type of boxes to roughly or rigorously define the object’s position, shape, or motion path in future frames.
Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base AI model’s knowledge and training only the control module. To address any training problems that might crop up, it has a novel self-tracking technique that simplifies the learning of correlation between the object in the frame and that box assigned to it.
According to ByteDance, this novel approach of generating synthetic videos helps improve the video quality by enhancing dynamic scene handling, box constraints providing a more realistic layout for video generation, and ability to handle multiple types of object alignment at the same time. Using the hard and soft boxes significantly improves the precision of object position, which results in much better motion control.
3 key benefits of Boximator
With Boximator, ByteDance claims that using a hard box to isolate the object within a bounding box and a soft box to define a wider region within which the object must stay, it has 3 key benefits over other diffusion models.
It's a flexible motion control tool. By adjusting smaller components, it adjusts both foreground and background objects' motion, as well as the pose of larger objects (like humans).
With Boxmiator, users can select objects by drawing boxes around them instead of using image-to-video and text-to-video methods. It's more straightforward than language-based controls that require verbal descriptions for everything.
Using algorithm-generated soft boxes, Boximator can control approximate motion paths even when users don't define an object in a box.
Based on their qualitative testing, the ByteDance team found that the results looked more realistic, with objects following complex user-defined paths and interactions. Boximator manages composite elements like a man on a horse, and can control objects' count, size, proximity, and more.
Boximator can change the future of video marketing - something to look forward to for marketers
The features Boximator offers suggest a high level of versatility in generating videos that balance quality, diversity, and user control. Boximator could potentially save a lot of computing power by externalizing motion specifications.
The versatility of synthetic video generation is the future of video marketing and can make marketers' lives easier on so many levels. Because Boximator plugs into existing video diffusion models, it can be easily integrated with any model. This would save time and effort learning a new platform.
At a conceptual level, marketers could use Boximator to automate certain aspects of video production like storyboarding and object animation. Next, they could make videos with dynamic elements, such as moving products or changing their colour.
Also, they can make multiple versions of the same video customized to a particular demographic or other customer preferences.
With the capabilities ByteDance has talked about, marketers could edit an ongoing meme video quickly. It could help increase awareness of their brand, product, or both.
No AI product available on the market is risk-free, and Boximator is not a full-fledged product yet. ByteDance has clearly cited ethical and social risks like Boximator could be used for creating deepfakes. There could be biases in AI-generated content, which could lead to unfair or discriminatory results. And of course, the impact it might have on the creative industries and intellectual property, which could undermine the value of human work.
Nevertheless, Boximator could address one of the major drawbacks of text-to-video diffusion models today: limited storytelling capabilities. Boximator seems to pack a punch from that perspective!
Comments