deepseek-ai / deepseek-vl2

DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

  • Public
  • 52.3K runs
  • A100 (80GB)
  • GitHub
  • Weights
  • Paper
  • License

Input

image
*file

Input image file

*string
Shift + Return to add a new line

Text prompt to guide the model

Including temperature and 3 more...

Output

The figure illustrates a three-stage process for training and fine-tuning a Vision-Language (VL) model using the DeepSeek LLM framework. Here's a detailed description of each stage: ### Stage 1: Training VL Adaptor - **Components Involved**: - **DeepSeek LLM**: This is the foundational language model used throughout the stages. - **Vision-Language Adaptor**: This component is responsible for adapting the language model to handle vision-language tasks. - **Hybrid Vision Encoder**: This encoder processes visual data and combines it with textual information. - **SAM-B and SigLIP-L**: These are specific models or components within the hybrid vision encoder that contribute to processing image-text pairs. - **Process**: - The DeepSeek LLM is used as the base model. - The Vision-Language Adaptor is trained on image-text pairs, which involves aligning visual and textual information. - The Hybrid Vision Encoder processes these image-text pairs, integrating both visual and textual features. ### Stage 2: Joint VL Pre-training - **Components Involved**: - **DeepSeek LLM**: Continues to be the core language model. - **Vision-Language Adaptor**: Further refined through joint pre-training. - **Hybrid Vision Encoder**: Enhanced to better handle interleaved vision and language sequences. - **SAM-B and SigLIP-L**: Continue to play roles in encoding visual and textual data. - **Process**: - The model undergoes joint pre-training using interleaved vision and language sequences. - This step helps the model learn to effectively combine and process both types of data simultaneously. - The Vision-Language Adaptor and Hybrid Vision Encoder are further optimized during this phase. ### Stage 3: Supervised Finetuning - **Components Involved**: - **DeepSeek LLM**: Now fully integrated into the VL system. - **Vision-Language Adaptor**: Fully adapted and ready for specific tasks. - **Hybrid Vision Encoder**: Finalized and capable of handling complex VL tasks. - **SAM-B and SigLIP-L**: Continue their roles in encoding and processing data. - **Process**: - The model is fine-tuned using VL chat data and pure language chat data. - This supervised finetuning phase refines the model's performance on specific VL tasks, such as conversational understanding and generation. - The Vision-Language Adaptor and Hybrid Vision Encoder are fine-tuned to ensure they work seamlessly together for the desired outcomes. Overall, this three-stage process leverages the strengths of the DeepSeek LLM and specialized adaptors to create a robust Vision-Language model capable of handling various tasks involving both visual and textual data.
Generated in

Run time and cost

This model costs approximately $0.0048 to run on Replicate, or 208 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 4 seconds.

Readme

1. Introduction

Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Github Repository

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ( Equal Contribution, ** Project Lead, *** Corresponding author)

2. Model Summary

DeepSeek-VL2 is built on DeepSeekMoE-27B.

3. Quick Start

Notifications

  1. We suggest to use a temperature T <= 0.7 when sampling. We observe a larger temperature decreases the generation quality.
  2. To keep the number of tokens managable in the context window, we apply dynamic tiling strategy to <=2 images. When there are >=3 images, we directly pad the images to 384*384 as inputs without tiling.
  3. The main difference between DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2 is the base LLM.

Gradio Demo (TODO)

4. License

This code repository is licensed under MIT License. The use of DeepSeek-VL2 models is subject to DeepSeek Model License. DeepSeek-VL2 series supports commercial use.

5. Citation

@misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels,
      title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding}, 
      author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan},
      year={2024},
      eprint={2412.10302},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.10302}, 
}

6. Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.