nvidia/nemotron-nano-v2-12b-vl

A multi-modal AI model for visual Q&A, summarization, and data extraction, supporting text, images, and video.

Public
672 runs

Run time and cost

This model costs approximately $0.14 to run on Replicate, or 7 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 100 seconds. The predict time for this model varies significantly based on the inputs.

Readme

NVIDIA Nemotron Nano VL 12B V2

A powerful 12.6B parameter vision-language model for document intelligence, visual question answering, and video understanding. Process up to 4 images at once or analyze videos with 128K context length. Supports 10 languages.

What This Model Can Do

  • Analyze multiple images - Compare up to 4 images simultaneously for document analysis, product comparisons, or before/after reviews
  • Understand videos - Automatically extract and analyze video frames for summarization and visual Q&A
  • Extract document data - Parse invoices, receipts, contracts, manuals, and forms with high accuracy
  • Answer visual questions - Ask detailed questions about images, charts, graphs, or scenes
  • Multi-lingual support - Works with English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese
  • High resolution processing - Handle images up to 3072×1024 pixels with 12-tile layout optimization

Example Uses

  • Single image: Upload a photo and ask “Describe this image in detail” or “What objects are visible?”
  • Compare images: Upload 2-4 images and ask “What are the differences between these images?”
  • Document processing: Upload invoice pages and ask “Extract all line items, prices, and totals”
  • Video analysis: Upload a video and ask “Summarize what happens in this video”

Inputs

Images (upload 1-4): - images - List of images (1-4 supported, required if not using video) - Supported formats: JPEG, PNG

Video (mutually exclusive with images): - video - MP4 video file - video_fps (1-30, default: 1) - Frames per second to extract - video_pruning_rate (0.0-1.0, default: 0.75) - Higher = faster, Lower = more detail

Text: - prompt - Your question or instruction about the media

Generation Settings: - max_new_tokens (1-2048) - Response length. Auto-set to 1024 for images, 512 for videos - temperature (0.0-2.0, default: 0.0) - Set to 0 for consistent results, higher for creativity - top_p (0.0-1.0, default: 1.0) - Diversity control - system_prompt (default: “/no_think”) - Use default for faster responses

Media Requirements

Images: - Resolution: 32×32 to 3072×1024 pixels (various aspect ratios supported) - Format: JPEG or PNG, RGB only - Quality: High-contrast, well-lit images work best for documents

Videos: - Format: MP4 - Processing: Frames extracted automatically at your chosen FPS - Optimal: 2 minutes at 2 fps, adjust based on content length

Tips for Best Results

Writing prompts: - Be specific: “List all items and prices” works better than “What’s in this image?” - For multiple images: Reference them explicitly (e.g., “Compare the layout of these documents”) - For complex reasoning: Remove /no_think from system_prompt

Video settings: - Short clips (< 30s): Use video_fps=5 - Medium clips (30s-2min): Use video_fps=2 (default) - Long videos (> 2min): Use video_fps=1 with higher pruning rate

Performance: - Keep temperature=0 for consistent results - Use default settings for fastest processing - Increase max_new_tokens for longer, detailed responses

Common Use Cases

  • Document processing: Invoices, receipts, contracts, manuals, forms
  • Visual Q&A: Product analysis, scene description, chart interpretation
  • Comparisons: Before/after images, product variants, document cross-reference
  • Video analysis: Summarization, activity recognition, tutorial comprehension
  • Data extraction: Pull structured data from visual sources
  • Multi-page workflows: Process multiple document pages at once

Model Details

  • Parameters: 12.6B
  • Context: 128K tokens
  • Architecture: C-RADIOv2-H vision encoder + Nemotron Nano V2 language model
  • Optimized for: NVIDIA GPUs (A100, H100, H200, B200)
  • Developed by: NVIDIA

License/Terms of Use

Use of this model is governed by the NVIDIA Software and Model Evaluation License Agreement