Readme
Purpose: Prepare arXiv papers for processing by Large Language Models (LLMs) by converting them into a single, expanded LaTeX file.
Overview
How It Works:
- Input: An arXiv URL (abstract, PDF, or HTML page).
- Process:
- Downloads and extracts the paper’s source files from arXiv.
- Identifies the main LaTeX file using heuristics.
- Expands all \input{} and \include{} commands into a single file using latexpand.
- Optionally includes or excludes comments and figures.
- Output: A single, self-contained LaTeX file ready for LLM consumption.
Input Parameters
arxiv_url: Any arXiv URL format is accepted:- Abstract page:
https://arxiv.org/abs/2004.10151 - PDF:
https://arxiv.org/pdf/2004.10151.pdf - HTML:
https://arxiv.org/html/2004.10151 include_figures: Boolean, defaultFalse. Whether to include figure definitions in the output.include_comments: Boolean, defaultTrue. Whether to include comments in the expanded LaTeX output.
Known Behaviors and Limitations
- Multiple Main Files Found:
- If multiple possible main
.texfiles are found (e.g., several files containing\documentclass), the model will fail to prevent unintended behavior. - What Happens:
- The model raises an error indicating that multiple main files were detected, and it’s ambiguous which one to use.
- Recommended Action:
- Users should ensure their arXiv submission contains a uniquely identifiable main TeX file, typically named
main.texor similar.
- Users should ensure their arXiv submission contains a uniquely identifiable main TeX file, typically named
-
Note:
- The model no longer accepts a
main_fileparameter to specify the main TeX file.
- The model no longer accepts a
-
Behavior of
latexpand: - No TeX Dependencies Required:
latexpandruns solely with Perl, without requiring TeX-related packages likekpsewhich, since we are not expanding style files.
- Comment Handling:
- By default, comments are included in the output (
include_commentsparameter isTrue). Users can exclude comments by setting this parameter toFalse.
- By default, comments are included in the output (
- Limitations:
- May not handle
\begin{verbatim}...\end{verbatim}blocks correctly, especially if they contain comments or inclusion commands. - Does not expand
.styfiles or handle complex macros that depend on external style files.
- May not handle
Notes
- Glitches in Our Code:
- The heuristic for finding the main TeX file may fail if the paper’s structure is unconventional.
-
Users may need to adjust their submissions or ensure their arXiv submission contains a uniquely identifiable main TeX file.
-
Glitches in
latexpand: - Special environments or macros may not be expanded as expected.
- Files within verbatim environments may be processed incorrectly.
Output
- The model returns a single expanded LaTeX file named
[arxiv_id]_expanded.texcontaining the complete paper content with all includes resolved.