You're looking at a specific version of this model. Jump to the model overview.
datalab-to /marker:c00ae56f
Input schema
The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
| Field | Type | Default value | Description |
|---|---|---|---|
| file |
string
|
Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
|
|
| max_pages |
integer
|
Min: 1 |
Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
|
| page_range |
string
|
Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
|
|
| force_ocr |
boolean
|
False
|
Force OCR on all pages of the PDF
|
| format_lines |
boolean
|
False
|
Format the lines in the output to detect inline math and styles
|
| paginate |
boolean
|
False
|
Whether to paginate the output. Each page will be separated by horizontal rules
|
| strip_existing_ocr |
boolean
|
False
|
Strip existing OCR text from the PDF and re-run OCR
|
| disable_image_extraction |
boolean
|
False
|
Disable image extraction from the PDF
|
| disable_ocr_math |
boolean
|
False
|
Disable inline math recognition in OCR
|
| use_llm |
boolean
|
False
|
Significantly improves accuracy by using an LLM. Will increase latency
|
| mode |
None
|
fast
|
Output mode: fast (lowest latency), balanced, or accurate (slowest)
|
| output_format |
string
|
markdown
|
Output format for the text. Can be 'json', 'html', 'markdown', or 'chunks'. You can comma separate multiple formats
|
| skip_cache |
boolean
|
False
|
Skip the cache and re-run inference
|
| save_checkpoint |
boolean
|
False
|
Save checkpoint after processing
|
| block_correction_prompt |
string
|
Optional prompt to improve output alignment to specific requirements
|
|
| page_schema |
string
|
Schema for structured extraction (JSON string of Pydantic schema)
|
|
| segmentation_schema |
string
|
Schema for document segmentation (JSON string with segment names and descriptions)
|
|
| additional_config |
string
|
Additional configuration options as JSON string. Supports keys like 'disable_links', 'keep_pageheader_in_output', 'keep_pagefooter_in_output', 'filter_blank_pages', 'drop_repeated_text', 'layout_coverage_threshold', 'merge_threshold', 'height_tolerance', 'gap_threshold', 'image_threshold', 'min_line_length', 'level_count', 'default_level', 'no_merge_tables_across_pages', 'force_layout_block'. See full documentation at https://documentation.datalab.to/api-reference/marker
|
|
| visualize_output |
boolean
|
False
|
Generate visualization images showing detected text regions overlaid on original document for OCR debugging
|
Output schema
The shape of the response you’ll get when you run this model with an API.
Schema
{'properties': {'chunks': {'items': {'additionalProperties': True,
'type': 'object'},
'nullable': True,
'title': 'Chunks',
'type': 'array'},
'html': {'nullable': True, 'title': 'Html', 'type': 'string'},
'images': {'items': {'anyOf': [],
'format': 'uri',
'type': 'string'},
'nullable': True,
'title': 'Images',
'type': 'array'},
'json_data': {'additionalProperties': True,
'nullable': True,
'title': 'Json Data',
'type': 'object'},
'markdown': {'nullable': True,
'title': 'Markdown',
'type': 'string'},
'metadata': {'additionalProperties': True,
'nullable': True,
'title': 'Metadata',
'type': 'object'},
'page_count': {'title': 'Page Count', 'type': 'integer'},
'visualization_images': {'items': {'anyOf': [],
'format': 'uri',
'type': 'string'},
'nullable': True,
'title': 'Visualization Images',
'type': 'array'}},
'required': ['page_count'],
'title': 'MarkerOutput',
'type': 'object'}