Readme
NVIDIA Nemotron Nano VL 12B V2
A powerful 12.6B parameter vision-language model for document intelligence, visual question answering, and video understanding. Process up to 4 images at once or analyze videos with 128K context length. Supports 10 languages.
What This Model Can Do
- Analyze multiple images - Compare up to 4 images simultaneously for document analysis, product comparisons, or before/after reviews
- Understand videos - Automatically extract and analyze video frames for summarization and visual Q&A
- Extract document data - Parse invoices, receipts, contracts, manuals, and forms with high accuracy
- Answer visual questions - Ask detailed questions about images, charts, graphs, or scenes
- Multi-lingual support - Works with English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese
- High resolution processing - Handle images up to 3072×1024 pixels with 12-tile layout optimization
Example Uses
- Single image: Upload a photo and ask “Describe this image in detail” or “What objects are visible?”
- Compare images: Upload 2-4 images and ask “What are the differences between these images?”
- Document processing: Upload invoice pages and ask “Extract all line items, prices, and totals”
- Video analysis: Upload a video and ask “Summarize what happens in this video”
Inputs
Images (upload 1-4):
- images - List of images (1-4 supported, required if not using video)
- Supported formats: JPEG, PNG
Video (mutually exclusive with images):
- video - MP4 video file
- video_fps (1-30, default: 1) - Frames per second to extract
- video_pruning_rate (0.0-1.0, default: 0.75) - Higher = faster, Lower = more detail
Text:
- prompt - Your question or instruction about the media
Generation Settings:
- max_new_tokens (1-2048) - Response length. Auto-set to 1024 for images, 512 for videos
- temperature (0.0-2.0, default: 0.0) - Set to 0 for consistent results, higher for creativity
- top_p (0.0-1.0, default: 1.0) - Diversity control
- system_prompt (default: “/no_think”) - Use default for faster responses
Media Requirements
Images: - Resolution: 32×32 to 3072×1024 pixels (various aspect ratios supported) - Format: JPEG or PNG, RGB only - Quality: High-contrast, well-lit images work best for documents
Videos: - Format: MP4 - Processing: Frames extracted automatically at your chosen FPS - Optimal: 2 minutes at 2 fps, adjust based on content length
Tips for Best Results
Writing prompts:
- Be specific: “List all items and prices” works better than “What’s in this image?”
- For multiple images: Reference them explicitly (e.g., “Compare the layout of these documents”)
- For complex reasoning: Remove /no_think from system_prompt
Video settings:
- Short clips (< 30s): Use video_fps=5
- Medium clips (30s-2min): Use video_fps=2 (default)
- Long videos (> 2min): Use video_fps=1 with higher pruning rate
Performance:
- Keep temperature=0 for consistent results
- Use default settings for fastest processing
- Increase max_new_tokens for longer, detailed responses
Common Use Cases
- Document processing: Invoices, receipts, contracts, manuals, forms
- Visual Q&A: Product analysis, scene description, chart interpretation
- Comparisons: Before/after images, product variants, document cross-reference
- Video analysis: Summarization, activity recognition, tutorial comprehension
- Data extraction: Pull structured data from visual sources
- Multi-page workflows: Process multiple document pages at once
Model Details
- Parameters: 12.6B
- Context: 128K tokens
- Architecture: C-RADIOv2-H vision encoder + Nemotron Nano V2 language model
- Optimized for: NVIDIA GPUs (A100, H100, H200, B200)
- Developed by: NVIDIA
License/Terms of Use
Use of this model is governed by the NVIDIA Software and Model Evaluation License Agreement