deepseek-ai / deepseek-vl-7b-base

DeepSeek-VL: An open-source Vision-Language Model designed for real-world vision and language understanding applications

  • Public
  • 3.5K runs
  • L40S
  • GitHub
  • Paper
  • License

Input

image
*file

Input image

string
Shift + Return to add a new line

Input prompt

Default: "Describe this image"

integer

Maximum number of tokens to generate

Default: 512

Output

The image depicts a three-stage process for training a vision-language model. 1. Stage 1: Training VL Adapter: In this stage, a vision-language adapter is trained using supervised fine-tuning. The adapter is trained on image-text pairs and pure language sequences. 2. Stage 2: Joint VL Pre-training: In this stage, a joint vision-language model is pre-trained using self-supervised learning. The model is trained on image-text pairs and pure language sequences. 3. Stage 3: Supervised Fine-tuning: In this stage, the model is fine-tuned on supervised tasks using image-text pairs and pure language sequences. The model is trained using a hybrid vision-language adapter, which combines a vision-language adapter with a language model. The model is trained on a variety of tasks, including image captioning, visual question answering, and visual reasoning. The model is able to understand the visual content of an image and generate a natural language description or answer.
Generated in

Run time and cost

This model costs approximately $0.0076 to run on Replicate, or 131 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 8 seconds. The predict time for this model varies significantly based on the inputs.

Readme

DeepSeek-VL-7b-base

Introduction

Introducing DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Github Repository

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan (*Equal Contribution, Project Leader)

Model Summary

DeepSeek-VL-7b-base uses the SigLIP-L and SAM-B as the hybrid vision encoder supporting 1024 x 1024 image input and is constructed based on the DeepSeek-LLM-7b-base which is trained on an approximate corpus of 2T text tokens. The whole DeepSeek-VL-7b-base model is finally trained around 400B vision-language tokens.

License

This code repository is licensed under the MIT License. The use of DeepSeek-VL Base/Chat models is subject to DeepSeek Model License. DeepSeek-VL series (including Base and Chat) supports commercial use.

Citation

@misc{lu2024deepseekvl,
      title={DeepSeek-VL: Towards Real-World Vision-Language Understanding}, 
      author={Haoyu Lu and Wen Liu and Bo Zhang and Bingxuan Wang and Kai Dong and Bo Liu and Jingxiang Sun and Tongzheng Ren and Zhuoshu Li and Yaofeng Sun and Chengqi Deng and Hanwei Xu and Zhenda Xie and Chong Ruan},
      year={2024},
      eprint={2403.05525},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.