You're looking at a specific version of this model. Jump to the model overview.

datalab-to /marker:380a8e62

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field Type Default value Description
mode
None
fast
Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information
file
string
Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
use_llm
boolean
False
Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time
paginate
boolean
False
Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n
force_ocr
boolean
False
Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output
max_pages
integer

Min: 1

Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
page_range
string
Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
skip_cache
boolean
False
Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results
page_schema
string
Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50%
format_lines
boolean
False
Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation
save_checkpoint
boolean
False
Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows
disable_ocr_math
boolean
False
Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX
additional_config
string
Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker
strip_existing_ocr
boolean
False
Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled
segmentation_schema
string
JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker
block_correction_prompt
string
Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure'
disable_image_extraction
boolean
False
Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{'description': 'Output from Marker document conversion.\n'
                '\n'
                'Field population:\n'
                '- markdown: Populated with clean markdown text (tables, '
                'equations, etc.) - typically available unless page_schema is '
                'used\n'
                '- json_data: Always populated with hierarchical document '
                'structure (block types, bounding boxes, positions, etc.)\n'
                '- metadata: Always populated with document metadata (page '
                'stats, table of contents, etc.)\n'
                '- page_count: Always populated with number of pages '
                'processed\n'
                '- images: Populated with extracted images as base64-encoded '
                'files (unless disable_image_extraction=True)\n'
                '- extraction_schema_json: Only populated when page_schema '
                'input is provided. Contains your extracted structured data as '
                'a JSON string with citation fields showing which document '
                'blocks were used for each field',
 'properties': {'extraction_schema_json': {'nullable': True,
                                           'title': 'Extraction Schema Json',
                                           'type': 'string'},
                'images': {'items': {'anyOf': [],
                                     'format': 'uri',
                                     'type': 'string'},
                           'nullable': True,
                           'title': 'Images',
                           'type': 'array'},
                'json_data': {'additionalProperties': True,
                              'nullable': True,
                              'title': 'Json Data',
                              'type': 'object'},
                'markdown': {'nullable': True,
                             'title': 'Markdown',
                             'type': 'string'},
                'metadata': {'additionalProperties': True,
                             'nullable': True,
                             'title': 'Metadata',
                             'type': 'object'},
                'page_count': {'title': 'Page Count', 'type': 'integer'}},
 'required': ['page_count'],
 'title': 'MarkerOutput',
 'type': 'object'}