datalab-to/marker:efee5c51 | Run with an API on Replicate

You're looking at a specific version of this model. Jump to the model overview.

datalab-to /marker:efee5c51

Input schema

The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.

Field	Type	Default value	Description
mode	None	fast	Output mode: fast (lowest latency), balanced, or accurate (slowest)
file	string		Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
use_llm	boolean	False	Significantly improves accuracy by using an LLM. Will increase latency
paginate	boolean	False	Whether to paginate the output. Each page will be separated by horizontal rules
force_ocr	boolean	False	Force OCR on all pages of the PDF
max_pages	integer	Min: 1	Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
page_range	string		Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
skip_cache	boolean	False	Skip the cache and re-run inference
page_schema	string		Schema for structured extraction (JSON string of Pydantic schema)
format_lines	boolean	False	Format the lines in the output to detect inline math and styles
output_format	string	markdown	Output format for the text. Can be 'json', 'html', 'markdown', or 'chunks'. You can comma separate multiple formats
save_checkpoint	boolean	False	Save checkpoint after processing
disable_ocr_math	boolean	False	Disable inline math recognition in OCR
additional_config	string		Additional configuration options as JSON string. Supports keys like 'disable_links', 'keep_pageheader_in_output', 'keep_pagefooter_in_output', 'filter_blank_pages', 'drop_repeated_text', 'layout_coverage_threshold', 'merge_threshold', 'height_tolerance', 'gap_threshold', 'image_threshold', 'min_line_length', 'level_count', 'default_level', 'no_merge_tables_across_pages', 'force_layout_block'. See full documentation at https://documentation.datalab.to/api-reference/marker
strip_existing_ocr	boolean	False	Strip existing OCR text from the PDF and re-run OCR
segmentation_schema	string		Schema for document segmentation (JSON string with segment names and descriptions)
block_correction_prompt	string		Optional prompt to improve output alignment to specific requirements
disable_image_extraction	boolean	False	Disable image extraction from the PDF

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema

{'properties': {'chunks': {'items': {'additionalProperties': True,
                                     'type': 'object'},
                           'nullable': True,
                           'title': 'Chunks',
                           'type': 'array'},
                'html': {'nullable': True, 'title': 'Html', 'type': 'string'},
                'images': {'items': {'anyOf': [],
                                     'format': 'uri',
                                     'type': 'string'},
                           'nullable': True,
                           'title': 'Images',
                           'type': 'array'},
                'json_data': {'additionalProperties': True,
                              'nullable': True,
                              'title': 'Json Data',
                              'type': 'object'},
                'markdown': {'nullable': True,
                             'title': 'Markdown',
                             'type': 'string'},
                'metadata': {'additionalProperties': True,
                             'nullable': True,
                             'title': 'Metadata',
                             'type': 'object'},
                'page_count': {'title': 'Page Count', 'type': 'integer'}},
 'required': ['page_count'],
 'title': 'MarkerOutput',
 'type': 'object'}