Depth estimation with faster inference speed, fewer parameters, and higher depth accuracy.
Finer and Faster Text-to-Image Generation via Relay Diffusion
CogVLM2: Visual Language Models for Image and Video Understanding
Scalable Streaming Speech Synthesis with Large Language Models
Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Depth Any Video with Scalable Synthetic Data
Generating Consistent Long Depth Sequences for Open-world Videos
Extended video synthesis model that generates 128 frames
Efficient Visual Generation with Hybrid Autoregressive Transformer
A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Diffusion-based Visual Foundation Model for High-quality Dense Prediction
DiT-based video generation model for generating high-quality videos in real-time
Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Sharp Monocular Metric Depth in Less Than a Second
High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
Autoregressive Image Generation without Vector Quantization
Autoregressive Video Generation without Vector Quantization
Minimal and Universal Control for Diffusion Transformer - demo for Spatially aligned control
Minimal and Universal Control for Diffusion Transformer - demo for Subject-driven generation
Convert LLM's coding to image generation
This model is cold. You'll get a fast response if the model is warm and already running, and a slower response if the model is cold and starting up.