Report output

Machine learning models are powerful, but they're also unmoderated. Sometimes the output of the models can be inappropriate or incorrect. It might be broken or buggy. It might cause copyright issues. It might be inappropriate content.

Whatever the reason, please use this form to report to us when something is wrong with the output of a model you've run.

We'll investigate the reported output and take appropriate action. We may flag this output to the model author if we think they should be aware.

Your report

Model

@chenxwh / deepseek-vl2

Output

The figure illustrates the three-stage training process for a Vision-Language (VL) model using the DeepSeek LLM. Here's a detailed description of each stage: ### Stage 1: Training VL Adaptor - **Components Involved:** - **DeepSeek LLM:** This is the large language model that serves as the backbone for the entire process. - **Vision-Language Adaptor:** This component is responsible for adapting the vision and language models to work together effectively. - **Hybrid Vision Encoder:** This encoder processes visual data and converts it into a format suitable for the model. It uses two specific models, SAM-B and SigLIP-L, which are pre-trained on image-text pairs. - **Image-Text Pairs:** These pairs are used to train the hybrid vision encoder. The encoder learns to understand and generate text based on visual inputs. - **Process:** - The hybrid vision encoder is trained using image-text pairs. This training helps the encoder learn how to map images to corresponding text descriptions. ### Stage 2: Joint VL Pre-training - **Components Involved:** - **DeepSeek LLM:** Continues to be the central model. - **Vision-Language Adaptor:** Continues to adapt the vision and language components. - **Hybrid Vision Encoder:** Continues to process visual data. - **Interleaved VL + Pure Language Sequences:** In this stage, both vision-language sequences and pure language sequences are interleaved. This means that the model is exposed to both types of data during training, allowing it to learn from a more diverse set of examples. - **Process:** - The model is pre-trained using a combination of vision-language sequences and pure language sequences. This joint training helps the model develop a better understanding of both visual and textual information. ### Stage 3: Supervised Finetuning - **Components Involved:** - **DeepSeek LLM:** Continues to be the central model. - **Vision-Language Adaptor:** Continues to refine the adaptation between vision and language. - **Hybrid Vision Encoder:** Continues to process visual data. - **VL Chat Data + Pure Language Chat Data:** In this final stage, the model is fine-tuned using both vision-language chat data and pure language chat data. This step focuses on improving the model's performance specifically in generating responses to chat queries. - **Process:** - The model undergoes supervised finetuning with a focus on chat tasks. The finetuning process involves adjusting the model parameters to optimize its performance in generating coherent and contextually relevant responses to user queries. ### Summary The figure outlines a comprehensive approach to training a Vision-Language model using the DeepSeek LLM. Starting with training the VL adaptor and hybrid vision encoder on image-text pairs, followed by joint pre-training with interleaved sequences, and finally, supervised finetuning with chat data, the model is progressively refined to enhance its ability to handle complex vision-language tasks.<｜end▁of▁sentence｜>

Reason for report

Give us a few details about how this output is unsafe or broken.