Step-Video-T2V AI: Text-to-Video Generator

Step-Video-T2V is a state-of-the-art text-to-video pre-trained model with 30 billion parameters, capable of generating videos up to 204 frames in length. It employs a deep compression Variational Autoencoder (Video-VAE) for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese.

Introduction

We present Step-Video-T2V, a model designed to enhance both training and inference efficiency. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality compared to both open-source and commercial engines.

Model Summary

Video-VAE: Achieves 16x16 spatial and 8x temporal compression ratios while maintaining video quality.
DiT with 3D Full Attention: Trained using Flow Matching to denoise input noise into latent frames.
Video-DPO: Reduces artifacts and improves the visual quality of generated videos.
Training Strategies: Incorporates human feedback to align generated content with human expectations.

Model Download and Usage

Download the model from Huggingface or Modelscope.
Ensure your system meets the requirements for running Step-Video-T2V.
Follow the installation and inference scripts to generate videos.

Note: This is an unofficial about page for Step-Video-T2V. For the most accurate information, please refer to official documentation: https://arxiv.org/abs/2502.10248.

About Step-Video-T2V

Introduction

Model Summary

Model Download and Usage