Extract text from documents and images with Datalab Marker and OCR

Posted October 21, 2025 by

Datalab’s state-of-the-art document parsing and text extraction models are now on Replicate.

OCR

Marker turns PDF, DOCX, PPTX, images (and more!) into markdown or JSON. It formats tables, math, and code, extracts images, and can pull specific fields when you pass a JSON Schema.

OCR detects text in ninety languages from images and documents, and returns reading order and table grids.

The Marker model is based on the popular open source Marker project (29k Github stars) and OCR is based on Surya (19k Github stars).

Run Marker and OCR on Replicate:

Run Marker

import replicate

output = replicate.run(
    "datalab-to/marker",
    input={
        "file": open("report.pdf", "rb"),
        "mode": "balanced",  # fast / balanced / accurate
        "include_metadata": True,  # return page-level JSON metadata
    },
)
print(output["markdown"][:400])

Run OCR

import replicate

output = replicate.run(
    "datalab-to/ocr",
    input={
        "file": open("receipt.jpg", "rb"),
        "visualize": True,  # return the input image with red polygons around detected text
        "return_pages": True,  # return layout data
    },
)
print(output["text"][:200])

Visit the models on Replicate for code snippets in other languages.

These models are both fast and accurate. They outperform established tools like Tesseract, with short processing times. Marker processes a page in about 0.18 seconds and can hit 120 pages per second when batched.

Structured extraction

One particularly powerful feature of Marker is structured extraction. For example, you can extract specific fields from an invoice:

import json
import replicate

schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "invoice_number": {"type": "string"},
        "date": {"type": "string"},
        "total": {"type": "number"}
    }
}

output = replicate.run(
    "datalab-to/marker",
    input={
        "file": "https://multimedia-example-files.replicate.dev/replicator-invoice.1page.pdf",
        "page_schema": json.dumps(schema),
    }
)
structured_data = json.loads(output["extraction_schema_json"])
print(structured_data)

Performance

Marker performance was evaluated using the olmOCR-Bench benchmark, a dataset of 1,403 PDF files with 7,010 unit test cases that evaluate the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information.

Marker outperforms all models tested, including GPT-4o, Deepseek OCR, Mistral OCR, and olmOCR.

ModelArXivOld Scans MathTablesOld ScansHeaders and FootersMulti columnLong tiny textBaseOverall
Datalab Marker (Balanced mode)81.480.389.450.088.381.091.699.982.7 ± 0.9
Datalab Marker (Fast mode)83.869.774.832.386.679.485.799.676.5 ± 1.0
Mistral OCR API77.267.560.629.393.671.377.199.472.0 ± 1.1
Deepseek OCR75.267.979.132.996.166.378.597.774.2 ± 1.0
Nanonets OCR67.068.677.739.540.769.953.499.364.5 ± 1.1
GPT-4o (Anchored)53.574.570.040.793.869.360.696.869.9 ± 1.1
Gemini Flash 2 (Anchored)54.556.172.134.264.761.571.595.663.8 ± 1.2
Qwen 2.5 VL (No Anchor)63.165.767.338.673.668.349.198.365.5 ± 1.2
olmOCR v0.3.078.679.972.943.995.177.381.298.978.5 ± 1.1

Pricing

Marker costs

  • $4 per 1000 pages without page_schema in fast and balanced modes.
  • $6 per 1000 pages when doing structured extraction with page_schema. $ $6 per 1000 pages in accurate mode.

OCR costs $2 per 1000 pages.