zsxkib / star

STAR Video Upscaler: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

  • Public
  • 241 runs
  • H100
  • GitHub
  • Weights
  • Paper
  • License

Input

*file

Input video file to enhance

string
Shift + Return to add a new line

Detailed text description of video content. Include: Main subjects, colors, motion details, quality aspects. Example: '4K close-up of a golden retriever running through autumn leaves, vibrant orange and yellow colors, sharp details'

Default: "Realistic high quality video with realistic details and vibrant colors"

integer
(minimum: 1, maximum: 4)

Super-resolution scaling factor.

Default: 4

Including solver_mode and 4 more...
string

Sampling strategy: 'fast' (fixed 15 steps) for quick results, 'normal' (custom steps) for quality tuning

Default: "normal"

integer
(minimum: 2, maximum: 50)

Number of diffusion steps (normal mode only). 1-5: balanced, 10-50: extreme details (very slower)

Default: 5

number
(minimum: 1, maximum: 20)

Text-video alignment strength. Lower: creative interpretation (5.0-7.5), Higher: strict prompt adherence (8.0-15.0)

Default: 7.5

integer
(minimum: 1, maximum: 32)

Frame group size for temporal processing. Higher values improve motion consistency but increase VRAM usage (24 = ~8GB VRAM)

Default: 24

integer
(minimum: 1, maximum: 10)

Parallel processing batches.

Default: 3

Output

Generated in

This example was created by a different version, zsxkib/star:839ca215.

Run time and cost

This model costs approximately $1.24 to run on Replicate, or 0 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia H100 GPU hardware. Predictions typically complete within 14 minutes. The predict time for this model varies significantly based on the inputs.

Readme

STAR: Spatial-Temporal Video Super-Resolution

STAR is a powerful text-guided video super-resolution model that can enhance low-quality videos while maintaining temporal consistency. It leverages text-to-video models to generate high-quality reference frames and combines them with spatial-temporal features for superior upscaling results.

More visual results can be found on our project page and video demo.

Usage

The model accepts: - A video file (supported formats: mp4, avi, mov) - Optional text prompt describing the video content - Target upscaling factor (default: 4x)

The model outputs an enhanced, higher-resolution version of the input video.

Limitations

  • For optimal results, input videos should be at least 240p resolution
  • Processing time increases with video length and resolution
  • Due to VRAM requirements, longer videos may need to be processed in segments
  • The CogVideoX-5B variant only supports 720x480 input resolution

Model Versions

Two variants are available:

  1. I2VGen-XL-based:
  2. Light degradation model: Best for mild quality enhancement
  3. Heavy degradation model: Optimized for severely degraded videos

  4. CogVideoX-5B-based:

  5. Specialized for heavy degradation scenarios
  6. Fixed input resolution of 720x480

Citation

@misc{xie2025starspatialtemporalaugmentationtexttovideo,
      title={STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution}, 
      author={Rui Xie and Yinhong Liu and Penghao Zhou and Chen Zhao and Jun Zhou and Kai Zhang and Zhenyu Zhang and Jian Yang and Zhenheng Yang and Ying Tai},
      year={2025},
      eprint={2501.02976},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

  • I2VGen-XL-based models: MIT License
  • CogVideoX-5B-based model: CogVideoX License

Maintained by @zsxkib for Replicate integration