deepseek-ai/deepseek-vl2

DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

Public
68.8K runs
  • Prediction

    deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200
    ID
    k3ss3jc7pdrj60cmypzsr8za48
    Status
    Succeeded
    Source
    Web
    Hardware
    A100 (80GB)
    Total duration
    Created

    Input

    image
    image
    top_p
    0.9
    prompt
    <|ref|>Watermelon slices.<|/ref|>
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    <|ref|>Watermelon slices.<|/ref|><|det|>[[0, 0, 998, 999]]<|/det|>
    Generated in
  • Prediction

    deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200
    ID
    exksbr65d9rj60cmypztam6hnr
    Status
    Succeeded
    Source
    Web
    Hardware
    A100 (80GB)
    Total duration
    Created

    Input

    image
    image
    top_p
    0.9
    prompt
    <|ref|>The giraffe at the back.<|/ref|>
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    <|ref|>The giraffe at the back.<|/ref|><|det|>[[0, 0, 998, 999]]<|/det|>
    Generated in
  • Prediction

    deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200
    ID
    ad7s5gashdrj00cmyq083sgkp8
    Status
    Succeeded
    Source
    Web
    Hardware
    A100 (80GB)
    Total duration
    Created
    by @lucataco

    Input

    image
    image
    top_p
    0.9
    prompt
    Describe each stage of this <image> in detail
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    The figure illustrates a three-stage process for training and fine-tuning a Vision-Language (VL) model using the DeepSeek LLM framework. Here's a detailed description of each stage: ### Stage 1: Training VL Adaptor - **Components Involved**: - **DeepSeek LLM**: This is the foundational language model used throughout the stages. - **Vision-Language Adaptor**: This component is responsible for adapting the language model to handle vision-language tasks. - **Hybrid Vision Encoder**: This encoder processes visual data and combines it with textual information. - **SAM-B and SigLIP-L**: These are specific models or components within the hybrid vision encoder that contribute to processing image-text pairs. - **Process**: - The DeepSeek LLM is used as the base model. - The Vision-Language Adaptor is trained on image-text pairs, which involves aligning visual and textual information. - The Hybrid Vision Encoder processes these image-text pairs, integrating both visual and textual features. ### Stage 2: Joint VL Pre-training - **Components Involved**: - **DeepSeek LLM**: Continues to be the core language model. - **Vision-Language Adaptor**: Further refined through joint pre-training. - **Hybrid Vision Encoder**: Enhanced to better handle interleaved vision and language sequences. - **SAM-B and SigLIP-L**: Continue to play roles in encoding visual and textual data. - **Process**: - The model undergoes joint pre-training using interleaved vision and language sequences. - This step helps the model learn to effectively combine and process both types of data simultaneously. - The Vision-Language Adaptor and Hybrid Vision Encoder are further optimized during this phase. ### Stage 3: Supervised Finetuning - **Components Involved**: - **DeepSeek LLM**: Now fully integrated into the VL system. - **Vision-Language Adaptor**: Fully adapted and ready for specific tasks. - **Hybrid Vision Encoder**: Finalized and capable of handling complex VL tasks. - **SAM-B and SigLIP-L**: Continue their roles in encoding and processing data. - **Process**: - The model is fine-tuned using VL chat data and pure language chat data. - This supervised finetuning phase refines the model's performance on specific VL tasks, such as conversational understanding and generation. - The Vision-Language Adaptor and Hybrid Vision Encoder are fine-tuned to ensure they work seamlessly together for the desired outcomes. Overall, this three-stage process leverages the strengths of the DeepSeek LLM and specialized adaptors to create a robust Vision-Language model capable of handling various tasks involving both visual and textual data.
    Generated in
  • Prediction

    deepseek-ai/deepseek-vl2:e5caf557dd9e5dcee46442e1315291ef1867f027991ede8ff95e304d4f734200
    ID
    0rj52k0asxrj00cmyq095jsbr8
    Status
    Succeeded
    Source
    Web
    Hardware
    A100 (80GB)
    Total duration
    Created

    Input

    image
    image
    top_p
    0.9
    prompt
    <|grounding|>I want to throw out the trash now, what should I do?
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    You can simply take <|ref|>the garbage<|/ref|><|det|>[[0, 0, 995, 993]]<|/det|> and put it in <|ref|>the trashcan<|/ref|><|det|>[[0, 0, 995, 993]]<|/det|> . It's right there for you to dispose of.
    Generated in

Want to make some of these yourself?

Run this model