deepseek-ai / deepseek-vl2-small

DeepSeek-VL2-small, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

  • Public
  • 681 runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License

Input

image
*file

Input image file

*string
Shift + Return to add a new line

Text prompt to guide the model

Including temperature and 3 more...

Output

The diagram illustrates a three-stage process for training and fine-tuning a model using a combination of vision-language (VL) adaptation, pre-training, and supervised finetuning. Here's a detailed breakdown of each stage: ### Stage 1: Training VL Adaptor - **Components Involved:** - **DeepSeek LLM:** This is the large language model that will be adapted to work with visual data. - **Vision-Language Adaptor:** A module designed to bridge the gap between visual inputs and the language model. - **Hybrid Vision Encoder:** This encoder processes visual information from images and text pairs. It consists of two components: - **SAM-B:** Likely a component related to spatial attention mechanisms. - **SigLIP-L:** Possibly a component related to signal processing or lip reading. - **Process:** - The hybrid vision encoder takes image-text pairs as input. - The vision-language adaptor then uses these processed visual features to interact with the DeepSeek LLM. ### Stage 2: Joint VL Pre-training - **Components Involved:** - Same as Stage 1, but now the system is being trained on interleaved VL + pure language sequences. - **Process:** - The hybrid vision encoder continues to process visual information. - The vision-language adaptor integrates these visual features with the DeepSeek LLM. - The model is pre-trained on a dataset that includes both visual-linguistic and purely linguistic sequences. ### Stage 3: Supervised Finetuning - **Components Involved:** - Same as Stage 1 and Stage 2, but now the system is being fine-tuned on VL chat data + pure language chat data. - **Process:** - The hybrid vision encoder processes the new dataset. - The vision-language adaptor combines the visual features with the DeepSeek LLM. - The model undergoes supervised finetuning to improve its performance on specific tasks, such as understanding and generating responses in VL chat contexts. ### Summary The overall process involves: 1. **Training the VL Adaptor:** Using image-text pairs to train the vision-language adaptor and hybrid vision encoder. 2. **Joint Pre-training:** Integrating visual and linguistic information to pre-train the model on a mixed dataset. 3. **Supervised Finetuning:** Fine-tuning the model on specialized VL chat data to enhance its capabilities in handling conversational tasks.
Generated in

Run time and cost

This model costs approximately $0.00098 to run on Replicate, or 1020 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 1 seconds.

Readme

1. Introduction

Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Github Repository

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ( Equal Contribution, ** Project Lead, *** Corresponding author)

2. Model Summary

DeepSeek-VL2-small is built on DeepSeekMoE-16B.

3. Quick Start

Notifications

  1. We suggest to use a temperature T <= 0.7 when sampling. We observe a larger temperature decreases the generation quality.
  2. To keep the number of tokens managable in the context window, we apply dynamic tiling strategy to <=2 images. When there are >=3 images, we directly pad the images to 384*384 as inputs without tiling.
  3. The main difference between DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2 is the base LLM.

Gradio Demo (TODO)

4. License

This code repository is licensed under MIT License. The use of DeepSeek-VL2 models is subject to DeepSeek Model License. DeepSeek-VL2 series supports commercial use.

5. Citation

@misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels,
      title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding}, 
      author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan},
      year={2024},
      eprint={2412.10302},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.10302}, 
}

6. Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.