Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Finer and Faster Text-to-Image Generation via Relay Diffusion
CogVLM2: Visual Language Models for Image and Video Understanding
Scalable Streaming Speech Synthesis with Large Language Models
Depth estimation with faster inference speed, fewer parameters, and higher depth accuracy.
Depth Any Video with Scalable Synthetic Data
Generating Consistent Long Depth Sequences for Open-world Videos
Extended video synthesis model that generates 128 frames
Efficient Visual Generation with Hybrid Autoregressive Transformer
A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Diffusion-based Visual Foundation Model for High-quality Dense Prediction
DiT-based video generation model for generating high-quality videos in real-time
Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Sharp Monocular Metric Depth in Less Than a Second
High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
Autoregressive Image Generation without Vector Quantization
Autoregressive Video Generation without Vector Quantization
Minimal and Universal Control for Diffusion Transformer - demo for Spatially aligned control
Minimal and Universal Control for Diffusion Transformer - demo for Subject-driven generation
Convert LLM's coding to image generation
This model is cold. You'll get a fast response if the model is warm and already running, and a slower response if the model is cold and starting up.