You're looking at a specific version of this model. Jump to the model overview.
datalab-to /marker:380a8e62
Input schema
The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
| Field | Type | Default value | Description |
|---|---|---|---|
| mode |
None
|
fast
|
Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information
|
| file |
string
|
Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
|
|
| use_llm |
boolean
|
False
|
Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time
|
| paginate |
boolean
|
False
|
Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n
|
| force_ocr |
boolean
|
False
|
Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output
|
| max_pages |
integer
|
Min: 1 |
Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
|
| page_range |
string
|
Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
|
|
| skip_cache |
boolean
|
False
|
Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results
|
| page_schema |
string
|
Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50%
|
|
| format_lines |
boolean
|
False
|
Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation
|
| save_checkpoint |
boolean
|
False
|
Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows
|
| disable_ocr_math |
boolean
|
False
|
Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX
|
| additional_config |
string
|
Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker
|
|
| strip_existing_ocr |
boolean
|
False
|
Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled
|
| segmentation_schema |
string
|
JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker
|
|
| block_correction_prompt |
string
|
Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure'
|
|
| disable_image_extraction |
boolean
|
False
|
Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field
|
Output schema
The shape of the response you’ll get when you run this model with an API.
Schema
{'description': 'Output from Marker document conversion.\n'
'\n'
'Field population:\n'
'- markdown: Populated with clean markdown text (tables, '
'equations, etc.) - typically available unless page_schema is '
'used\n'
'- json_data: Always populated with hierarchical document '
'structure (block types, bounding boxes, positions, etc.)\n'
'- metadata: Always populated with document metadata (page '
'stats, table of contents, etc.)\n'
'- page_count: Always populated with number of pages '
'processed\n'
'- images: Populated with extracted images as base64-encoded '
'files (unless disable_image_extraction=True)\n'
'- extraction_schema_json: Only populated when page_schema '
'input is provided. Contains your extracted structured data as '
'a JSON string with citation fields showing which document '
'blocks were used for each field',
'properties': {'extraction_schema_json': {'nullable': True,
'title': 'Extraction Schema Json',
'type': 'string'},
'images': {'items': {'anyOf': [],
'format': 'uri',
'type': 'string'},
'nullable': True,
'title': 'Images',
'type': 'array'},
'json_data': {'additionalProperties': True,
'nullable': True,
'title': 'Json Data',
'type': 'object'},
'markdown': {'nullable': True,
'title': 'Markdown',
'type': 'string'},
'metadata': {'additionalProperties': True,
'nullable': True,
'title': 'Metadata',
'type': 'object'},
'page_count': {'title': 'Page Count', 'type': 'integer'}},
'required': ['page_count'],
'title': 'MarkerOutput',
'type': 'object'}