deepseek-ai/deepseek-vl2

DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL
Prediction
deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200IDk3ss3jc7pdrj60cmypzsr8za48StatusSucceededSourceWebHardwareA100 (80GB)Total durationCreatedInput
Output
<|ref|>Watermelon slices.<|/ref|><|det|>[[0, 0, 998, 999]]<|/det|>Prediction
deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200IDexksbr65d9rj60cmypztam6hnrStatusSucceededSourceWebHardwareA100 (80GB)Total durationCreatedInput
Output
<|ref|>The giraffe at the back.<|/ref|><|det|>[[0, 0, 998, 999]]<|/det|>Prediction
deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200IDad7s5gashdrj00cmyq083sgkp8StatusSucceededSourceWebHardwareA100 (80GB)Total durationCreatedby @lucatacoInput
Output
The figure illustrates a three-stage process for training and fine-tuning a Vision-Language (VL) model using the DeepSeek LLM framework. Here's a detailed description of each stage: ### Stage 1: Training VL Adaptor - **Components Involved**: - **DeepSeek LLM**: This is the foundational language model used throughout the stages. - **Vision-Language Adaptor**: This component is responsible for adapting the language model to handle vision-language tasks. - **Hybrid Vision Encoder**: This encoder processes visual data and combines it with textual information. - **SAM-B and SigLIP-L**: These are specific models or components within the hybrid vision encoder that contribute to processing image-text pairs. - **Process**: - The DeepSeek LLM is used as the base model. - The Vision-Language Adaptor is trained on image-text pairs, which involves aligning visual and textual information. - The Hybrid Vision Encoder processes these image-text pairs, integrating both visual and textual features. ### Stage 2: Joint VL Pre-training - **Components Involved**: - **DeepSeek LLM**: Continues to be the core language model. - **Vision-Language Adaptor**: Further refined through joint pre-training. - **Hybrid Vision Encoder**: Enhanced to better handle interleaved vision and language sequences. - **SAM-B and SigLIP-L**: Continue to play roles in encoding visual and textual data. - **Process**: - The model undergoes joint pre-training using interleaved vision and language sequences. - This step helps the model learn to effectively combine and process both types of data simultaneously. - The Vision-Language Adaptor and Hybrid Vision Encoder are further optimized during this phase. ### Stage 3: Supervised Finetuning - **Components Involved**: - **DeepSeek LLM**: Now fully integrated into the VL system. - **Vision-Language Adaptor**: Fully adapted and ready for specific tasks. - **Hybrid Vision Encoder**: Finalized and capable of handling complex VL tasks. - **SAM-B and SigLIP-L**: Continue their roles in encoding and processing data. - **Process**: - The model is fine-tuned using VL chat data and pure language chat data. - This supervised finetuning phase refines the model's performance on specific VL tasks, such as conversational understanding and generation. - The Vision-Language Adaptor and Hybrid Vision Encoder are fine-tuned to ensure they work seamlessly together for the desired outcomes. Overall, this three-stage process leverages the strengths of the DeepSeek LLM and specialized adaptors to create a robust Vision-Language model capable of handling various tasks involving both visual and textual data.Prediction
deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200ID0rj52k0asxrj00cmyq095jsbr8StatusSucceededSourceWebHardwareA100 (80GB)Total durationCreatedInput
Output
You can simply take <|ref|>the garbage<|/ref|><|det|>[[0, 0, 995, 993]]<|/det|> and put it in <|ref|>the trashcan<|/ref|><|det|>[[0, 0, 995, 993]]<|/det|> . It's right there for you to dispose of.
Want to make some of these yourself?
Run this model