You're looking at a specific version of this model. Jump to the model overview.
datalab-to /marker:029e8d68
Input schema
The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
| Field | Type | Default value | Description |
|---|---|---|---|
| mode |
None
|
fast
|
Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information
|
| file |
string
|
Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
|
|
| use_llm |
boolean
|
False
|
Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time
|
| paginate |
boolean
|
False
|
Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n
|
| force_ocr |
boolean
|
False
|
Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output
|
| max_pages |
integer
|
Min: 1 |
Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
|
| page_range |
string
|
Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
|
|
| skip_cache |
boolean
|
False
|
Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results
|
| page_schema |
string
|
Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50%
|
|
| format_lines |
boolean
|
False
|
Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation
|
| save_checkpoint |
boolean
|
False
|
Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows
|
| disable_ocr_math |
boolean
|
False
|
Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX
|
| include_metadata |
boolean
|
False
|
Include detailed metadata and JSON structure in the output. When enabled, returns json_data (hierarchical document structure with bounding boxes) and metadata (page stats, table of contents). When disabled (default), only returns markdown to reduce response size
|
| additional_config |
string
|
Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker
|
|
| strip_existing_ocr |
boolean
|
False
|
Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled
|
| segmentation_schema |
string
|
JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker
|
|
| block_correction_prompt |
string
|
Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure'
|
|
| disable_image_extraction |
boolean
|
False
|
Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field
|
Output schema
The shape of the response you’ll get when you run this model with an API.
Schema
{'properties': {'extraction_schema_json': {'nullable': True,
'title': 'Extraction Schema Json',
'type': 'string'},
'images': {'items': {'anyOf': [],
'format': 'uri',
'type': 'string'},
'nullable': True,
'title': 'Images',
'type': 'array'},
'json_data': {'additionalProperties': True,
'nullable': True,
'title': 'Json Data',
'type': 'object'},
'markdown': {'nullable': True,
'title': 'Markdown',
'type': 'string'},
'metadata': {'additionalProperties': True,
'nullable': True,
'title': 'Metadata',
'type': 'object'},
'page_count': {'title': 'Page Count', 'type': 'integer'}},
'required': ['page_count'],
'title': 'MarkerOutput',
'type': 'object'}