GLM-OCR

GLM-OCR is a compact, multimodal OCR model built for real-world document understanding. With just 0.9 billion parameters, it ranks first on OmniDocBench V1.5 with a score of 94.62 and delivers state-of-the-art results on text recognition, formula recognition, table parsing, and structured information extraction. It is designed to be small enough to run on inexpensive GPUs while staying competitive with much larger document models.

Under the hood, GLM-OCR uses the GLM-V encoder–decoder architecture: a CogViT vision tower pre-trained on large-scale image–text data, a lightweight cross-modal connector that downsamples visual tokens efficiently, and a GLM-0.5B language decoder. Training uses a Multi-Token Prediction loss together with stable, full-task reinforcement learning, which the authors report improves both training efficiency and generalization across document types.

What it can do

GLM-OCR exposes four task modes through this Replicate wrapper.

The first mode, text, performs general-purpose OCR on any document image. It handles dense paragraphs, mixed scripts, decorative typography, and noisy real-world scans, returning clean reading-order text.

The second mode, formula, recognizes mathematical formulas and returns LaTeX-style markup suitable for rendering or downstream parsing. This includes inline math, multi-line equations, and complex notation typical of scientific papers.

The third mode, table, parses tabular regions and returns structured table markup that preserves rows, columns, and cell content. It is robust to merged cells, ruled and ruleless layouts, and code-heavy or form-style tables.

The fourth mode, custom, lets you supply your own prompt for structured information extraction. The intended use is to provide a strict JSON schema describing the fields you want extracted, for example identity-document fields, invoice line items, or contract metadata. The model will return a JSON object that follows your schema, which makes it easy to plug directly into downstream pipelines.

How to use it

Upload a document image and pick one of the four task modes. For text, formula, and table you do not need to write any prompt — the wrapper supplies the right one for you. For custom mode, paste a JSON schema as the custom prompt and the model will fill it in.

For best results, send images at their native resolution where possible. The vision encoder uses a 14-pixel patch size with a 336-pixel base image size and supports a wide range of aspect ratios, so feeding crops at 1000 pixels or higher on the long side typically gives the cleanest output. If you have multi-page documents, run one page at a time.

Strengths and limitations

GLM-OCR is optimized for documents, including scans, photos of documents, screenshots, and natively rendered PDFs that you have rasterized to images. It performs especially well on complex tables, formula-heavy academic papers, code blocks, identity documents, and stamped or sealed forms. Throughput on a single GPU is roughly 1.86 pages per second on PDF inputs and 0.67 images per second on photographs, according to the authors’ speed test.

It is not a general-purpose vision-language model. It will not caption photos, answer open-ended questions about scenes, or perform creative writing. For information extraction, the quality of the output depends heavily on the strictness of the JSON schema you provide — vague prompts produce vague outputs.

The model supports Chinese, English, French, Spanish, Russian, German, Japanese, and Korean. Performance on other scripts may be limited.

Hardware and licensing (1/2)

Hermes-Model: GLM-OCR is small enough to run comfortably on a single inexpensive GPU. It uses about 2.2 gigabytes of GPU memory in BF16, which means it fits on entry-level cards including the NVIDIA T4. Cold start on Replicate is typically under a minute once the image is warm, and warm predictions complete in a handful of seconds.

The model weights are released by Z.ai under the MIT License. The wrapper code in this repository is released under the Apache 2.0 License. The official end-to-end document parsing SDK from Z.ai also incorporates PP-DocLayoutV3, which is Apache 2.0 licensed; this Cog wrapper does not bundle that layout component, and the four task modes above use the GLM-OCR weights directly without a separate layout-analysis step.

Credits

GLM-OCR is developed by Z.ai. If you use it in your work, please cite the GLM-OCR Technical Report by Duan, Xue, Wang, and colleagues, available on arXiv. The full author list, evaluation tables, and additional inference recipes including vLLM, SGLang, and Ollama integrations are available on the upstream model card on Hugging Face at zai-org/GLM-OCR, and on the project page at github.com/zai-org/GLM-OCR.

Model created 2 weeks, 6 days ago