You're looking at a specific version of this model. Jump to the model overview.

datalab-to /marker:efee5c51

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field Type Default value Description
mode
None
fast
Output mode: fast (lowest latency), balanced, or accurate (slowest)
file
string
Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
use_llm
boolean
False
Significantly improves accuracy by using an LLM. Will increase latency
paginate
boolean
False
Whether to paginate the output. Each page will be separated by horizontal rules
force_ocr
boolean
False
Force OCR on all pages of the PDF
max_pages
integer

Min: 1

Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
page_range
string
Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
skip_cache
boolean
False
Skip the cache and re-run inference
page_schema
string
Schema for structured extraction (JSON string of Pydantic schema)
format_lines
boolean
False
Format the lines in the output to detect inline math and styles
output_format
string
markdown
Output format for the text. Can be 'json', 'html', 'markdown', or 'chunks'. You can comma separate multiple formats
save_checkpoint
boolean
False
Save checkpoint after processing
disable_ocr_math
boolean
False
Disable inline math recognition in OCR
additional_config
string
Additional configuration options as JSON string. Supports keys like 'disable_links', 'keep_pageheader_in_output', 'keep_pagefooter_in_output', 'filter_blank_pages', 'drop_repeated_text', 'layout_coverage_threshold', 'merge_threshold', 'height_tolerance', 'gap_threshold', 'image_threshold', 'min_line_length', 'level_count', 'default_level', 'no_merge_tables_across_pages', 'force_layout_block'. See full documentation at https://documentation.datalab.to/api-reference/marker
strip_existing_ocr
boolean
False
Strip existing OCR text from the PDF and re-run OCR
segmentation_schema
string
Schema for document segmentation (JSON string with segment names and descriptions)
block_correction_prompt
string
Optional prompt to improve output alignment to specific requirements
disable_image_extraction
boolean
False
Disable image extraction from the PDF

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{'properties': {'chunks': {'items': {'additionalProperties': True,
                                     'type': 'object'},
                           'nullable': True,
                           'title': 'Chunks',
                           'type': 'array'},
                'html': {'nullable': True, 'title': 'Html', 'type': 'string'},
                'images': {'items': {'anyOf': [],
                                     'format': 'uri',
                                     'type': 'string'},
                           'nullable': True,
                           'title': 'Images',
                           'type': 'array'},
                'json_data': {'additionalProperties': True,
                              'nullable': True,
                              'title': 'Json Data',
                              'type': 'object'},
                'markdown': {'nullable': True,
                             'title': 'Markdown',
                             'type': 'string'},
                'metadata': {'additionalProperties': True,
                             'nullable': True,
                             'title': 'Metadata',
                             'type': 'object'},
                'page_count': {'title': 'Page Count', 'type': 'integer'}},
 'required': ['page_count'],
 'title': 'MarkerOutput',
 'type': 'object'}