turian / arxiv-llm-text

Prepare arXiv papers for processing by Large Language Models (LLMs) by converting them into a single, expanded LaTeX file.

  • Public
  • 7 runs
  • GitHub
  • License

Run time and cost

This model runs on CPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Purpose: Prepare arXiv papers for processing by Large Language Models (LLMs) by converting them into a single, expanded LaTeX file.

Overview

How It Works: - Input: An arXiv URL (abstract, PDF, or HTML page). - Process: - Downloads and extracts the paper’s source files from arXiv. - Identifies the main LaTeX file using heuristics. - Expands all \input{} and \include{} commands into a single file using latexpand. - Optionally includes or excludes comments and figures. - Output: A single, self-contained LaTeX file ready for LLM consumption.

Input Parameters

Known Behaviors and Limitations

  • Multiple Main Files Found:
  • If multiple possible main .tex files are found (e.g., several files containing \documentclass), the model will fail to prevent unintended behavior.
  • What Happens:
    • The model raises an error indicating that multiple main files were detected, and it’s ambiguous which one to use.
  • Recommended Action:
    • Users should ensure their arXiv submission contains a uniquely identifiable main TeX file, typically named main.tex or similar.
  • Note:

    • The model no longer accepts a main_file parameter to specify the main TeX file.
  • Behavior of latexpand:

  • No TeX Dependencies Required:
    • latexpand runs solely with Perl, without requiring TeX-related packages like kpsewhich, since we are not expanding style files.
  • Comment Handling:
    • By default, comments are included in the output (include_comments parameter is True). Users can exclude comments by setting this parameter to False.
  • Limitations:
    • May not handle \begin{verbatim}...\end{verbatim} blocks correctly, especially if they contain comments or inclusion commands.
    • Does not expand .sty files or handle complex macros that depend on external style files.

Notes

  • Glitches in Our Code:
  • The heuristic for finding the main TeX file may fail if the paper’s structure is unconventional.
  • Users may need to adjust their submissions or ensure their arXiv submission contains a uniquely identifiable main TeX file.

  • Glitches in latexpand:

  • Special environments or macros may not be expanded as expected.
  • Files within verbatim environments may be processed incorrectly.

Output

  • The model returns a single expanded LaTeX file named [arxiv_id]_expanded.tex containing the complete paper content with all includes resolved.