Readme

NVIDIA Nemotron Nano VL 12B V2

A powerful 12.6B parameter vision-language model for document intelligence, visual question answering, and video understanding. Process up to 4 images at once or analyze videos with 128K context length. Supports 10 languages.

What This Model Can Do

Analyze multiple images - Compare up to 4 images simultaneously for document analysis, product comparisons, or before/after reviews
Understand videos - Automatically extract and analyze video frames for summarization and visual Q&A
Extract document data - Parse invoices, receipts, contracts, manuals, and forms with high accuracy
Answer visual questions - Ask detailed questions about images, charts, graphs, or scenes
Multi-lingual support - Works with English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese
High resolution processing - Handle images up to 3072×1024 pixels with 12-tile layout optimization

Example Uses

Single image: Upload a photo and ask “Describe this image in detail” or “What objects are visible?”
Compare images: Upload 2-4 images and ask “What are the differences between these images?”
Document processing: Upload invoice pages and ask “Extract all line items, prices, and totals”
Video analysis: Upload a video and ask “Summarize what happens in this video”

Inputs

Images (upload 1-4): - images - List of images (1-4 supported, required if not using video) - Supported formats: JPEG, PNG

Video (mutually exclusive with images): - video - MP4 video file - video_fps (1-30, default: 1) - Frames per second to extract - video_pruning_rate (0.0-1.0, default: 0.75) - Higher = faster, Lower = more detail

Text: - prompt - Your question or instruction about the media

Generation Settings: - max_new_tokens (1-2048) - Response length. Auto-set to 1024 for images, 512 for videos - temperature (0.0-2.0, default: 0.0) - Set to 0 for consistent results, higher for creativity - top_p (0.0-1.0, default: 1.0) - Diversity control - system_prompt (default: “/no_think”) - Use default for faster responses

Media Requirements

Images: - Resolution: 32×32 to 3072×1024 pixels (various aspect ratios supported) - Format: JPEG or PNG, RGB only - Quality: High-contrast, well-lit images work best for documents

Videos: - Format: MP4 - Processing: Frames extracted automatically at your chosen FPS - Optimal: 2 minutes at 2 fps, adjust based on content length

Tips for Best Results

Writing prompts: - Be specific: “List all items and prices” works better than “What’s in this image?” - For multiple images: Reference them explicitly (e.g., “Compare the layout of these documents”) - For complex reasoning: Remove /no_think from system_prompt

Video settings: - Short clips (< 30s): Use video_fps=5 - Medium clips (30s-2min): Use video_fps=2 (default) - Long videos (> 2min): Use video_fps=1 with higher pruning rate

Performance: - Keep temperature=0 for consistent results - Use default settings for fastest processing - Increase max_new_tokens for longer, detailed responses

Common Use Cases

Document processing: Invoices, receipts, contracts, manuals, forms
Visual Q&A: Product analysis, scene description, chart interpretation
Comparisons: Before/after images, product variants, document cross-reference
Video analysis: Summarization, activity recognition, tutorial comprehension
Data extraction: Pull structured data from visual sources
Multi-page workflows: Process multiple document pages at once

Model Details

Parameters: 12.6B
Context: 128K tokens
Architecture: C-RADIOv2-H vision encoder + Nemotron Nano V2 language model
Optimized for: NVIDIA GPUs (A100, H100, H200, B200)
Developed by: NVIDIA

License/Terms of Use

Use of this model is governed by the NVIDIA Software and Model Evaluation License Agreement

Model created 4 months, 3 weeks ago

Run time and cost