adirik / udop-large

Performs document image classification, document parsing and document visual question answering

  • Public
  • 207 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 GPU hardware.

Readme

UDOP

UDOP, a unified model for document classification, layout parsing and visual question answering by Microsoft. See the paper, original repository, and HF model page for details.

How to Use the API

To use UDOP, you need to provide an image file and a text prompt describing the task you want to perform on the document. The API input arguments are as follows:

  • image: Path to the input image file of the document.
  • prompt: Text prompt for describing the task.

Usage Tips

Text prompt should contain both task definition and (if necessary) the task itself and they should be separated by a point. For example,

  • In the case of the question answering task, prompt should be “Question answering. In which year is the report made?”
  • However, for the document classification task, the prompt should be “Document classification.”

References

@misc{tang2023unifying,
      title={Unifying Vision, Text, and Layout for Universal Document Processing}, 
      author={Zineng Tang and Ziyi Yang and Guoxin Wang and Yuwei Fang and Yang Liu and Chenguang Zhu and Michael Zeng and Cha Zhang and Mohit Bansal},
      year={2023},
      eprint={2212.02623},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}