Readme

STAR: Spatial-Temporal Video Super-Resolution

STAR is a powerful text-guided video super-resolution model that can enhance low-quality videos while maintaining temporal consistency. It leverages text-to-video models to generate high-quality reference frames and combines them with spatial-temporal features for superior upscaling results.

More visual results can be found on our project page and video demo.

Usage

The model accepts: - A video file (supported formats: mp4, avi, mov) - Optional text prompt describing the video content - Target upscaling factor (default: 4x)

The model outputs an enhanced, higher-resolution version of the input video.

Limitations

For optimal results, input videos should be at least 240p resolution
Processing time increases with video length and resolution
Due to VRAM requirements, longer videos may need to be processed in segments
The CogVideoX-5B variant only supports 720x480 input resolution

Model Versions

Two variants are available:

I2VGen-XL-based:
Light degradation model: Best for mild quality enhancement
Heavy degradation model: Optimized for severely degraded videos
CogVideoX-5B-based:
Specialized for heavy degradation scenarios
Fixed input resolution of 720x480

Citation

@misc{xie2025starspatialtemporalaugmentationtexttovideo,
      title={STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution}, 
      author={Rui Xie and Yinhong Liu and Penghao Zhou and Chen Zhao and Jun Zhou and Kai Zhang and Zhenyu Zhang and Jian Yang and Zhenheng Yang and Ying Tai},
      year={2025},
      eprint={2501.02976},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

I2VGen-XL-based models: MIT License
CogVideoX-5B-based model: CogVideoX License

Maintained by @zsxkib for Replicate integration

Examples