bytedance / dolphin

Document Image Parsing via Heterogeneous Anchor Prompting

  • Public
  • 140 runs
  • T4
  • GitHub
  • Weights
  • License

Input

*file

PDF or image file to parse

string

Output format

Default: "markdown_content"

Output

# LLaMA: Open and Efficient Foundation Language Models Hugo Touvron; Thibaut Lavril; Gautier Izacard; Xavier Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave; Guillaume Lample $^⋆$ Meta AI ## Abstract We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (17SB) on most benchmarks, and LLaMA65B is competitive with the best models. Chinchilla-70B and PaLM-540B. We release all our models to the research community $^1$ . ## 1 Introduction Large Languages Models (LLMs) trained on massive corpora of texts have shown their ability to perform new tasks from textual instructions or from a few examples ( Brown et al. , 2020 ) . These few-shot properties first appeared when scaling models to a sufficient size ( Kaplan et al. , 2020 ) , resulting in a line of work that focuses on further scaling these models ( Chowdhery et al. , 2022 ; Rae et al. , 2021 ) . These efforts are based on the assumption that more parameters will lead to better performance. However, recent work from Hoffmann et al. ( 2022 ) shows that, for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data. The objective of the scaling laws from Hoffmann et al. ( 2022 ) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. ( 2022 ) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens. The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used. The resulting models, called LLaMA , ranges from 7B to 65B parameters with competitive performance compared to the best existing LLMs. For instance, LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10 $\times$ smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU. At the higher-end of the scale, our 65B-parameter model is also competitive with the best large language models such as Chinchilla or PaLM-540B. Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “ Books – 2TB" or “ Social media conversations"). There exist some exceptions, notably OPT ( Zhang et al. , 2022 ) , GPT-NeoX ( Black et al. , 2022 ) , BLOOM ( Scao et al. , 2022 ) and GLM ( Zeng et al. , 2022 ) , but none that are competitive with PaLM-62B or Chinchilla. In the rest of this paper, we present an overview of the modifications we made to the transformer architecture ( Vaswani et al. , 2017 ) , as well as our training method. We then report the performance of our models and compare with others LLMs on a set of standard benchmarks. Finally, we expose some of the biases and toxicity encoded in our models, using some of the most recent benchmarks from the responsible AI community. * Equal contribution. Correspondence: (htouvron, thibautlav,gizacard,egrave,glample)@meta.com https://github.com/facebookresearch/llam arXiv:2302.1397lv1 [cs.CL] 27 Feb 2023
Generated in

Run time and cost

This model runs on Nvidia T4 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Model Description

Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.

📑 Overview

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:

  1. 🔍 Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order
  2. 🧩 Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts

Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.

Model Architecture

Dolphin is built on a vision-encoder-decoder architecture using transformers:

  • Vision Encoder: Based on Swin Transformer for extracting visual features from document images
  • Text Decoder: Based on MBart for decoding text from visual features
  • Prompt-based interface: Uses natural language prompts to control parsing tasks

The model is implemented as a Hugging Face VisionEncoderDecoderModel for easy integration with the Transformers ecosystem.

Usage

Our demo will be released in these days. Please keep tuned! 🔥

Please refer to our GitHub repository for detailed usage.

License

This model is released under the MIT License.

Citation

@inproceedings{dolphin2025,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
  year={2025},
  booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
}

Acknowledgements

This model builds on several open-source projects including: - Hugging Face Transformers - Donut - Nougat - Swin Transformer