LLM Sherpa to load files of many types. LLM Sherpa supports different file formats including DOCX, PPTX, HTML, TXT, and XML.
LLMSherpaFileLoader use LayoutPDFReader, which is part of the LLMSherpa library. This tool is designed to parse PDFs while preserving their layout information, which is often lost when using most PDF to text parsers.
Here are some key features of LayoutPDFReader:
- It can identify and extract sections and subsections along with their levels.
- It combines lines to form paragraphs.
- It can identify links between sections and paragraphs.
- It can extract tables along with the section the tables are found in.
- It can identify and extract lists and nested lists.
- It can join content spread across pages.
- It can remove repeating headers and footers.
- It can remove watermarks.
INFO: this library fail with some pdf files so use it with caution.
LLMSherpaFileLoader
Under the hood LLMSherpaFileLoader defined some strategist to load file content: [“sections”, “chunks”, “html”, “text”], setup nlm-ingestor to getllmsherpa_api_url or use the default.
sections strategy: return the file parsed into sections
chunks strategy: return the file parsed into chunks
html strategy: return the file as one html document
text strategy: return the file as one text document
Connect these docs programmatically to Claude, VSCode, and more via MCP for    real-time answers.