nvidia/nemotron-nano-v2-12b-vl

A multi-modal AI model for visual Q&A, summarization, and data extraction, supporting text, images, and video.

Public
814 runs

NVIDIA Nemotron Nano VL 12B V2

A powerful 12.6B parameter vision-language model for document intelligence, visual question answering, and video understanding. Process up to 4 images at once or analyze videos with 128K context length. Supports 10 languages.

What This Model Can Do

  • Analyze multiple images - Compare up to 4 images simultaneously for document analysis, product comparisons, or before/after reviews
  • Understand videos - Automatically extract and analyze video frames for summarization and visual Q&A
  • Extract document data - Parse invoices, receipts, contracts, manuals, and forms with high accuracy
  • Answer visual questions - Ask detailed questions about images, charts, graphs, or scenes
  • Multi-lingual support - Works with English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese
  • High resolution processing - Handle images up to 3072×1024 pixels with 12-tile layout optimization

Example Uses

  • Single image: Upload a photo and ask “Describe this image in detail” or “What objects are visible?”
  • Compare images: Upload 2-4 images and ask “What are the differences between these images?”
  • Document processing: Upload invoice pages and ask “Extract all line items, prices, and totals”
  • Video analysis: Upload a video and ask “Summarize what happens in this video”

Inputs

Images (upload 1-4): - images - List of images (1-4 supported, required if not using video) - Supported formats: JPEG, PNG

Video (mutually exclusive with images): - video - MP4 video file - video_fps (1-30, default: 1) - Frames per second to extract - video_pruning_rate (0.0-1.0, default: 0.75) - Higher = faster, Lower = more detail

Text: - prompt - Your question or instruction about the media

Generation Settings: - max_new_tokens (1-2048) - Response length. Auto-set to 1024 for images, 512 for videos - temperature (0.0-2.0, default: 0.0) - Set to 0 for consistent results, higher for creativity - top_p (0.0-1.0, default: 1.0) - Diversity control - system_prompt (default: “/no_think”) - Use default for faster responses

Media Requirements

Images: - Resolution: 32×32 to 3072×1024 pixels (various aspect ratios supported) - Format: JPEG or PNG, RGB only - Quality: High-contrast, well-lit images work best for documents

Videos: - Format: MP4 - Processing: Frames extracted automatically at your chosen FPS - Optimal: 2 minutes at 2 fps, adjust based on content length

Tips for Best Results

Writing prompts: - Be specific: “List all items and prices” works better than “What’s in this image?” - For multiple images: Reference them explicitly (e.g., “Compare the layout of these documents”) - For complex reasoning: Remove /no_think from system_prompt

Video settings: - Short clips (< 30s): Use video_fps=5 - Medium clips (30s-2min): Use video_fps=2 (default) - Long videos (> 2min): Use video_fps=1 with higher pruning rate

Performance: - Keep temperature=0 for consistent results - Use default settings for fastest processing - Increase max_new_tokens for longer, detailed responses

Common Use Cases

  • Document processing: Invoices, receipts, contracts, manuals, forms
  • Visual Q&A: Product analysis, scene description, chart interpretation
  • Comparisons: Before/after images, product variants, document cross-reference
  • Video analysis: Summarization, activity recognition, tutorial comprehension
  • Data extraction: Pull structured data from visual sources
  • Multi-page workflows: Process multiple document pages at once

Model Details

  • Parameters: 12.6B
  • Context: 128K tokens
  • Architecture: C-RADIOv2-H vision encoder + Nemotron Nano V2 language model
  • Optimized for: NVIDIA GPUs (A100, H100, H200, B200)
  • Developed by: NVIDIA

License/Terms of Use

Use of this model is governed by the NVIDIA Software and Model Evaluation License Agreement