ByteDance, the parent company of TikTok, has dropped a family of joint image-video generation models called Goku. The models seem to be named after the popular anime character ‘Goku’ from the Dragon Ball series.
This comes right after the company teased a video AI model that generates videos from images, dubbed OmniHuman-1.
Researchers claim that the Goku models help create product videos featuring AI-generated influencers, marketing avatars, landscape demos, visualising Chinese poetry, portrait video demos, and more.

The research paper attributes the model’s ability to generate high-quality videos to several key factors. One is the implementation of a rectified flow (RF) formulation for joint image and video generation and the employment of a 3D joint image-video VAE to compress inputs into a shared latent space.
Moreover, the architecture features a Transformer network with full attention, enhanced with techniques like FlashAttention, sequence parallelism, Patch n’ Pack, 3D RoPE position embedding, and Q-K normalisation.
The paper also states that the Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks when compared to competitors like Luma, Open-Sora, Mira, and Pika.
Goku achieved 0.76 on GenEval, 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. You can see the benchmark results below.

“We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models,” the researchers said.
The model’s ability to generate high-quality product videos featuring AI-generated influencers and other realistic visuals could hugely benefit content creators, influencers, marketers, and others.