chenxwh / deepseek-vl2

Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

  • Public
  • 1.1K runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License
  • Prediction

    chenxwh/deepseek-vl2:8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
    ID
    6gx56twc39rme0cm2ccsdrrmcw
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    text
    Describe each stage of this image in detail
    top_p
    0.9
    image1
    image1
    temperature
    0.1
    max_new_tokens
    2048
    repetition_penalty
    1.1

    Output

    The figure illustrates the three-stage training process for a Vision-Language (VL) model using the DeepSeek LLM. Here's a detailed description of each stage: ### Stage 1: Training VL Adaptor - **Components Involved:** - **DeepSeek LLM:** This is the large language model that serves as the backbone for the entire process. - **Vision-Language Adaptor:** This component is responsible for adapting the vision and language models to work together effectively. - **Hybrid Vision Encoder:** This encoder processes visual data and converts it into a format suitable for the model. It uses two specific models, SAM-B and SigLIP-L, which are pre-trained on image-text pairs. - **Image-Text Pairs:** These pairs are used to train the hybrid vision encoder. The encoder learns to understand and generate text based on visual inputs. - **Process:** - The hybrid vision encoder is trained using image-text pairs. This training helps the encoder learn how to map images to corresponding text descriptions. ### Stage 2: Joint VL Pre-training - **Components Involved:** - **DeepSeek LLM:** Continues to be the central model. - **Vision-Language Adaptor:** Continues to adapt the vision and language components. - **Hybrid Vision Encoder:** Continues to process visual data. - **Interleaved VL + Pure Language Sequences:** In this stage, both vision-language sequences and pure language sequences are interleaved. This means that the model is exposed to both types of data during training, allowing it to learn from a more diverse set of examples. - **Process:** - The model is pre-trained using a combination of vision-language sequences and pure language sequences. This joint training helps the model develop a better understanding of both visual and textual information. ### Stage 3: Supervised Finetuning - **Components Involved:** - **DeepSeek LLM:** Continues to be the central model. - **Vision-Language Adaptor:** Continues to refine the adaptation between vision and language. - **Hybrid Vision Encoder:** Continues to process visual data. - **VL Chat Data + Pure Language Chat Data:** In this final stage, the model is fine-tuned using both vision-language chat data and pure language chat data. This step focuses on improving the model's performance specifically in generating responses to chat queries. - **Process:** - The model undergoes supervised finetuning with a focus on chat tasks. The finetuning process involves adjusting the model parameters to optimize its performance in generating coherent and contextually relevant responses to user queries. ### Summary The figure outlines a comprehensive approach to training a Vision-Language model using the DeepSeek LLM. Starting with training the VL adaptor and hybrid vision encoder on image-text pairs, followed by joint pre-training with interleaved sequences, and finally, supervised finetuning with chat data, the model is progressively refined to enhance its ability to handle complex vision-language tasks.<|end▁of▁sentence|>
    Generated in
  • Prediction

    chenxwh/deepseek-vl2:8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
    ID
    8knq41b1zxrme0cm2cfb8jxneg
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    text
    <|ref|>The giraffe at the back.<|/ref|>
    top_p
    0.9
    image1
    image1
    temperature
    0.1
    max_new_tokens
    2048
    repetition_penalty
    1.1

    Output

    img_out

    img_out

    text_out

    <|ref|>The giraffe at the back.<|/ref|><|det|>[[583, 271, 942, 888]]<|/det|><|end▁of▁sentence|>
    Generated in
  • Prediction

    chenxwh/deepseek-vl2:8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
    ID
    g7g5qnefrhrma0cm2cft94nwvw
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    text
    <|grounding|>In the first image, an object within the red rectangle is marked. Locate the object of the same category in the second image.
    top_p
    0.9
    image1
    image1
    image2
    image2
    temperature
    0.1
    max_new_tokens
    2048
    repetition_penalty
    1.1

    Output

    img_out

    img_out

    text_out

    <|ref|>bicolor cat<|/ref|><|det|>[[81, 273, 906, 820]]<|/det|><|end▁of▁sentence|>
    Generated in
  • Prediction

    chenxwh/deepseek-vl2:8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
    ID
    54dbyeaknxrme0cm2cgt4eyzgg
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    text
    <|grounding|>I want to throw out the trash now, what should I do?
    top_p
    0.9
    image1
    image1
    temperature
    0.1
    max_new_tokens
    2048
    repetition_penalty
    1.1

    Output

    img_out

    img_out

    text_out

    You can use <|ref|>the trash can<|/ref|><|det|>[[589, 643, 807, 999]]<|/det|> that's located on the sidewalk. It has a lid and is designed for public use.<|end▁of▁sentence|>
    Generated in
  • Prediction

    chenxwh/deepseek-vl2:8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
    ID
    wksm0fwe7hrmc0cm2cg81fvfbw
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    text
    Give me some suggestions for recipes using all the materials in the three images
    top_p
    0.9
    image1
    image1
    image2
    image2
    image3
    image3
    temperature
    0.1
    max_new_tokens
    2048
    repetition_penalty
    1.1

    Output

    Certainly! Here are a few recipe ideas that incorporate carrots, corn, and raw meat (likely pork ribs): ### Carrot Corn Pork Ribs Soup **Ingredients:** - 4 large carrots - 2 ears of corn - 1 pound pork ribs - 2 tablespoons olive oil - 1 onion, chopped - 3 cloves garlic, minced - 8 cups chicken broth - 1 teaspoon dried thyme - Salt and pepper to taste **Instructions:** 1. **Preparation:** - Peel and chop the carrots into bite-sized pieces. - Remove the husks from the corn cobs and cut off the ends. - Cut the pork ribs into smaller pieces. 2. **Cooking:** - Heat the olive oil in a large pot over medium heat. - Add the chopped onion and sauté until translucent. - Add the minced garlic and cook for another minute. - Add the pork ribs and brown them on all sides. - Pour in the chicken broth and add the thyme. - Bring to a boil, then reduce the heat and let it simmer for about an hour or until the meat is tender. - Add the carrots and corn cobs, cover, and continue cooking for another 30 minutes. - Season with salt and pepper to taste. ### Grilled Pork Ribs with Roasted Carrots and Corn Salsa **Ingredients:** - 4 pork ribs - 4 large carrots - 2 ears of corn - 1/4 cup olive oil - 2 tablespoons honey - 1 tablespoon soy sauce - 1 teaspoon ground cumin - 1 teaspoon smoked paprika - 1/4 cup fresh cilantro, chopped - Salt and pepper to taste **Instructions:** 1. **Marinating:** - In a bowl, mix together olive oil, honey, soy sauce, cumin, smoked paprika, and salt and pepper. - Place the pork ribs in the marinade, turning to coat evenly. Let it marinate for at least 2 hours, preferably overnight. 2. **Grilling:** - Preheat your grill to medium-high heat. - Remove the ribs from the marinade, letting any excess liquid drip off. - Grill the ribs for about 5-7 minutes per side, or until they reach your desired level of doneness. 3. **Roasting:** - Preheat your oven to 400°F (200°C). - Toss the carrots and corn with olive oil, salt, and pepper. Spread them out on a baking sheet. - Roast in the oven for about 20-25 minutes, or until the vegetables are tender and slightly caramelized. 4. **Assembly:** - Serve the grilled pork ribs hot, accompanied by roasted carrots and corn. - Garnish with fresh cilantro and serve with a side of salsa made from the remaining marinade ingredients. These recipes should provide a delicious and hearty meal using the ingredients shown in the images. Enjoy cooking!<|end▁of▁sentence|>
    Generated in
  • Prediction

    chenxwh/deepseek-vl2:8ea887897e772107ce53f3a7fa4850e78ae88b2b73ff854b4700db9f0d59c7cb
    ID
    00e9h1m96xrmc0cm2cjv7rj8fc
    Status
    Succeeded
    Source
    Web
    Hardware
    L40S
    Total duration
    Created

    Input

    text
    are the images showing the same animal? describe each of them in detail
    top_p
    0.9
    image1
    image1
    image2
    image2
    temperature
    0.1
    max_new_tokens
    2048
    repetition_penalty
    1.1

    Output

    No, the two images are not showing the same animal. The first image features a raccoon with distinctive black "mask" markings around its eyes and a bushy tail. It has a grayish-brown fur coat and is looking directly at the camera. The second image shows a red panda with reddish-brown fur, white markings on its face, and a fluffy tail. It is perched on a branch against a clear blue sky background.<|end▁of▁sentence|>
    Generated in

Want to make some of these yourself?

Run this model