datalab-to/marker:029e8d68 | Run with an API on Replicate

You're looking at a specific version of this model. Jump to the model overview.

datalab-to /marker:029e8d68

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field	Type	Default value	Description
mode	None	fast	Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information
file	string		Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
use_llm	boolean	False	Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time
paginate	boolean	False	Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n
force_ocr	boolean	False	Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output
max_pages	integer	Min: 1	Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
page_range	string		Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
skip_cache	boolean	False	Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results
page_schema	string		Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50%
format_lines	boolean	False	Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation
save_checkpoint	boolean	False	Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows
disable_ocr_math	boolean	False	Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX
include_metadata	boolean	False	Include detailed metadata and JSON structure in the output. When enabled, returns json_data (hierarchical document structure with bounding boxes) and metadata (page stats, table of contents). When disabled (default), only returns markdown to reduce response size
additional_config	string		Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker
strip_existing_ocr	boolean	False	Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled
segmentation_schema	string		JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker
block_correction_prompt	string		Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure'
disable_image_extraction	boolean	False	Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema

{'properties': {'extraction_schema_json': {'nullable': True,
                                           'title': 'Extraction Schema Json',
                                           'type': 'string'},
                'images': {'items': {'anyOf': [],
                                     'format': 'uri',
                                     'type': 'string'},
                           'nullable': True,
                           'title': 'Images',
                           'type': 'array'},
                'json_data': {'additionalProperties': True,
                              'nullable': True,
                              'title': 'Json Data',
                              'type': 'object'},
                'markdown': {'nullable': True,
                             'title': 'Markdown',
                             'type': 'string'},
                'metadata': {'additionalProperties': True,
                             'nullable': True,
                             'title': 'Metadata',
                             'type': 'object'},
                'page_count': {'title': 'Page Count', 'type': 'integer'}},
 'required': ['page_count'],
 'title': 'MarkerOutput',
 'type': 'object'}