deepseek-ai / deepseek-vl2-small

DeepSeek-VL2-small, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL

  • Public
  • 683 runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License
Run with an API
  • Prediction

    deepseek-ai/deepseek-vl2-small:965f877546f4f0e3f5ae59847f27079d66da81fff2d87d753d5495e862dcb3a4
    ID
    ecz06tf7m5rme0cmypw9te2crm
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created
    by @lucataco

    Input

    image
    image
    top_p
    0.9
    prompt
    Find all the <|ref|>watermelon slices<|/ref|>
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    <|ref|>watermelon slices<|/ref|><|det|>[[159, 327, 408, 615]]<|/det|>
    Generated in
  • Prediction

    deepseek-ai/deepseek-vl2-small:965f877546f4f0e3f5ae59847f27079d66da81fff2d87d753d5495e862dcb3a4
    ID
    ebt05v9rt5rmc0cmypwvmwe5d4
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    image
    image
    top_p
    0.9
    prompt
    <|ref|>The giraffe at the back.<|/ref|>
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    <|ref|>The giraffe at the back.<|/ref|><|det|>[[104, 289, 357, 613]]<|/det|>
    Generated in
  • Prediction

    deepseek-ai/deepseek-vl2-small:965f877546f4f0e3f5ae59847f27079d66da81fff2d87d753d5495e862dcb3a4
    ID
    1h2sgpbrr9rm80cmypwvz1efd8
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    image
    image
    top_p
    0.9
    prompt
    <|grounding|>I want to throw out the trash now, what should I do?
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    Well, you can just take it out of your house and place it in the <|ref|>garbage<|/ref|><|det|>[[135, 207, 869, 945]]<|/det|> .
    Generated in
  • Prediction

    deepseek-ai/deepseek-vl2-small:965f877546f4f0e3f5ae59847f27079d66da81fff2d87d753d5495e862dcb3a4
    ID
    9g5fzk28qdrme0cmypz9z43sfc
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    image
    image
    top_p
    0.9
    prompt
    Describe each stage of this <image> in detail
    temperature
    0.1
    max_length_tokens
    2048
    repetition_penalty
    1.1

    Output

    The diagram illustrates a three-stage process for training and fine-tuning a model using a combination of vision-language (VL) adaptation, pre-training, and supervised finetuning. Here's a detailed breakdown of each stage: ### Stage 1: Training VL Adaptor - **Components Involved:** - **DeepSeek LLM:** This is the large language model that will be adapted to work with visual data. - **Vision-Language Adaptor:** A module designed to bridge the gap between visual inputs and the language model. - **Hybrid Vision Encoder:** This encoder processes visual information from images and text pairs. It consists of two components: - **SAM-B:** Likely a component related to spatial attention mechanisms. - **SigLIP-L:** Possibly a component related to signal processing or lip reading. - **Process:** - The hybrid vision encoder takes image-text pairs as input. - The vision-language adaptor then uses these processed visual features to interact with the DeepSeek LLM. ### Stage 2: Joint VL Pre-training - **Components Involved:** - Same as Stage 1, but now the system is being trained on interleaved VL + pure language sequences. - **Process:** - The hybrid vision encoder continues to process visual information. - The vision-language adaptor integrates these visual features with the DeepSeek LLM. - The model is pre-trained on a dataset that includes both visual-linguistic and purely linguistic sequences. ### Stage 3: Supervised Finetuning - **Components Involved:** - Same as Stage 1 and Stage 2, but now the system is being fine-tuned on VL chat data + pure language chat data. - **Process:** - The hybrid vision encoder processes the new dataset. - The vision-language adaptor combines the visual features with the DeepSeek LLM. - The model undergoes supervised finetuning to improve its performance on specific tasks, such as understanding and generating responses in VL chat contexts. ### Summary The overall process involves: 1. **Training the VL Adaptor:** Using image-text pairs to train the vision-language adaptor and hybrid vision encoder. 2. **Joint Pre-training:** Integrating visual and linguistic information to pre-train the model on a mixed dataset. 3. **Supervised Finetuning:** Fine-tuning the model on specialized VL chat data to enhance its capabilities in handling conversational tasks.
    Generated in

Want to make some of these yourself?

Run this model