zsxkib/star

STAR Video Upscaler: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Public
626 runs

STAR: Spatial-Temporal Video Super-Resolution

STAR is a powerful text-guided video super-resolution model that can enhance low-quality videos while maintaining temporal consistency. It leverages text-to-video models to generate high-quality reference frames and combines them with spatial-temporal features for superior upscaling results.

More visual results can be found on our project page and video demo.

Usage

The model accepts: - A video file (supported formats: mp4, avi, mov) - Optional text prompt describing the video content - Target upscaling factor (default: 4x)

The model outputs an enhanced, higher-resolution version of the input video.

Limitations

  • For optimal results, input videos should be at least 240p resolution
  • Processing time increases with video length and resolution
  • Due to VRAM requirements, longer videos may need to be processed in segments
  • The CogVideoX-5B variant only supports 720x480 input resolution

Model Versions

Two variants are available:

  1. I2VGen-XL-based:
  2. Light degradation model: Best for mild quality enhancement
  3. Heavy degradation model: Optimized for severely degraded videos

  4. CogVideoX-5B-based:

  5. Specialized for heavy degradation scenarios
  6. Fixed input resolution of 720x480

Citation

@misc{xie2025starspatialtemporalaugmentationtexttovideo,
      title={STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution}, 
      author={Rui Xie and Yinhong Liu and Penghao Zhou and Chen Zhao and Jun Zhou and Kai Zhang and Zhenyu Zhang and Jian Yang and Zhenheng Yang and Ying Tai},
      year={2025},
      eprint={2501.02976},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

  • I2VGen-XL-based models: MIT License
  • CogVideoX-5B-based model: CogVideoX License

Maintained by @zsxkib for Replicate integration