lucataco/deepseek-ocr

Convert documents to markdown, extract raw text, and locate specific content

Public
29K runs

Run time and cost

This model costs approximately $0.0036 to run on Replicate, or 277 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 4 seconds.

Readme

DeepSeek OCR

Extract text from images and convert documents into clean, structured markdown.

What it does

DeepSeek OCR reads text from images and turns it into markdown. Upload a screenshot, PDF, scanned document, or photo with text, and it’ll extract everything while preserving the structure like tables, headings, and formatting.

This model handles:

Documents and PDFs: Academic papers, financial reports, textbooks, newspapers, and handwritten notes across 100 languages

Complex layouts: Multi-column documents, tables, forms, and receipts with the structure intact

Mathematical content: Equations and formulas converted to LaTeX format

Scientific content: Chemical formulas and geometric figures

Charts and visualizations: Data extraction from graphs and charts into structured formats

How it works differently

Most optical character recognition tools just find text and spit it back out. DeepSeek OCR understands the entire document as a visual sequence. It sees the layout, interprets the structure, and generates markdown like a person would write it.

The result is markdown you can immediately use in notebooks, documentation, or feed into other AI models without cleanup.

What makes it interesting

DeepSeek OCR compresses visual information incredibly efficiently. A document that would normally need 700-800 text tokens can be processed using just 100 visual tokens while maintaining 97% accuracy. This breakthrough approach treats documents as compressed visual data rather than just extracting characters.

The model adapts its compression based on document complexity. Simple slides might use 64 tokens, while dense newspapers automatically switch to a higher-detail mode using around 800 tokens.

Performance

On a single A100 GPU, DeepSeek OCR can process over 200,000 pages per day at around 2,500 tokens per second. It achieves state-of-the-art accuracy on document parsing benchmarks while using fewer tokens than other models.

Example outputs

The model excels at preserving document structure

Tables: Extracts rows and columns into proper HTML or markdown table format, maintaining alignment and relationships between cells

Equations: Recognizes mathematical expressions and outputs them as properly formatted LaTeX

Multi-language documents: Handles mixed scripts within the same document, like Korean and English side-by-side

Handwritten notes: Digitizes handwritten text while attempting to preserve the original structure

Common use cases

Document digitization: Convert scanned papers, books, or archival materials into searchable, structured text

Data extraction: Pull tables and figures from reports for analysis

Invoice processing: Extract line items, totals, and structured data from receipts and invoices

Academic research: Convert PDFs of papers into markdown for note-taking or further processing

Training data generation: Process large volumes of documents to create datasets for training other AI models

Technical details

The model combines a visual encoder (around 380 million parameters) with a small mixture-of-experts language model decoder (3 billion parameters with 570 million activated). The encoder uses both local window attention for fine details and global attention for broader context understanding.

DeepSeek OCR was trained on 30 million PDF pages covering approximately 100 languages, plus synthetic data including 10 million charts, 5 million chemical formulas, and 1 million geometric figures.

Learn more

For technical details and the research behind the model, check out the DeepSeek OCR paper and GitHub repository.

Try the model yourself on Replicate Playground.