BaseLoader interface.
Interface
Each document loader may define its own parameters, but they share a common API:.load()– Loads all documents at once..lazy_load()– Streams documents lazily, useful for large datasets.
By category
Webpages
The below document loaders allow you to load webpages.| Document Loader | Description | Package/API |
|---|---|---|
| Web | Uses urllib and BeautifulSoup to load and parse HTML web pages | Package |
| Unstructured | Uses Unstructured to load and parse web pages | Package |
| RecursiveURL | Recursively scrapes all child links from a root URL | Package |
| Sitemap | Scrapes all pages on a given sitemap | Package |
| Spider | Crawler and scraper that returns LLM-ready data | API |
| Firecrawl | API service that can be deployed locally | API |
| Docling | Uses Docling to load and parse web pages | Package |
| Hyperbrowser | Platform for running and scaling headless browsers, can be used to scrape/crawl any site | API |
| AgentQL | Web interaction and structured data extraction from any web page using an AgentQL query or a Natural Language prompt | API |
PDFs
The below document loaders allow you to load PDF documents.| Document Loader | Description | Package/API |
|---|---|---|
| PyPDF | Uses pypdf to load and parse PDFs | Package |
| Unstructured | Uses Unstructured’s open source library to load PDFs | Package |
| Amazon Textract | Uses AWS API to load PDFs | API |
| MathPix | Uses MathPix to load PDFs | Package |
| PDFPlumber | Load PDF files using PDFPlumber | Package |
| PyPDFDirectry | Load a directory with PDF files | Package |
| PyPDFium2 | Load PDF files using PyPDFium2 | Package |
| PyMuPDF | Load PDF files using PyMuPDF | Package |
| PyMuPDF4LLM | Load PDF content to Markdown using PyMuPDF4LLM | Package |
| PDFMiner | Load PDF files using PDFMiner | Package |
| Upstage Document Parse Loader | Load PDF files using UpstageDocumentParseLoader | Package |
| Docling | Load PDF files using Docling | Package |
| UnDatasIO | Load PDF files using UnDatasIO | Package |
| OpenDataLoader PDF | Load PDF files using OpenDataLoader PDF | Package |
Cloud Providers
The below document loaders allow you to load documents from your favorite cloud providers.| Document Loader | Description | Partner Package | API reference |
|---|---|---|---|
| AWS S3 Directory | Load documents from an AWS S3 directory | ❌ | S3DirectoryLoader |
| AWS S3 File | Load documents from an AWS S3 file | ❌ | S3FileLoader |
| Azure AI Data | Load documents from Azure AI services | ❌ | AzureAIDataLoader |
| Azure Blob Storage | Load documents from Azure Blob Storage | ✅ | AzureBlobStorageLoader |
| Dropbox | Load documents from Dropbox | ❌ | DropboxLoader |
| Google Cloud Storage Directory | Load documents from GCS bucket | ✅ | GCSDirectoryLoader |
| Google Cloud Storage File | Load documents from GCS file object | ✅ | GCSFileLoader |
| Google Drive | Load documents from Google Drive (Google Docs only) | ✅ | GoogleDriveLoader |
| Huawei OBS Directory | Load documents from Huawei Object Storage Service Directory | ❌ | OBSDirectoryLoader |
| Huawei OBS File | Load documents from Huawei Object Storage Service File | ❌ | OBSFileLoader |
| Microsoft OneDrive | Load documents from Microsoft OneDrive | ❌ | OneDriveLoader |
| Microsoft SharePoint | Load documents from Microsoft SharePoint | ❌ | SharePointLoader |
| Tencent COS Directory | Load documents from Tencent Cloud Object Storage Directory | ❌ | TencentCOSDirectoryLoader |
| Tencent COS File | Load documents from Tencent Cloud Object Storage File | ❌ | TencentCOSFileLoader |
Social Platforms
The below document loaders allow you to load documents from different social media platforms.| Document Loader | API reference |
|---|---|
TwitterTweetLoader | |
RedditPostsLoader |
Messaging Services
The below document loaders allow you to load data from different messaging platforms.| Document Loader | API reference |
|---|---|
| Telegram | TelegramChatFileLoader |
WhatsAppChatLoader | |
| Discord | DiscordChatLoader |
| Facebook Chat | FacebookChatLoader |
| Mastodon | MastodonTootsLoader |
Productivity tools
The below document loaders allow you to load data from commonly used productivity tools.| Document Loader | API reference |
|---|---|
| Figma | FigmaFileLoader |
| Notion | NotionDirectoryLoader |
| Slack | SlackDirectoryLoader |
| Quip | QuipLoader |
| Trello | TrelloLoader |
| Roam | RoamLoader |
| GitHub | GithubFileLoader |
Common File Types
The below document loaders allow you to load data from common data formats.| Document Loader | Data Type |
|---|---|
| CSVLoader | CSV files |
| Unstructured | Many file types (see https://docs.unstructured.io/platform/supported-file-types) |
| JSONLoader | JSON files |
| BSHTMLLoader | HTML files |
| DoclingLoader | Various file types (see https://ds4sd.github.io/docling/) |
| PolarisAIDataInsightLoader | Various file types (see https://datainsight.polarisoffice.com/documentation?docType=doc_extract) |
All document loaders
acreom
AgentQLLoader
AirbyteLoader
Airtable
Alibaba Cloud MaxCompute
Amazon Textract
Apify Dataset
ArxivLoader
AssemblyAI Audio Transcripts
AstraDB
Async Chromium
AsyncHtml
Athena
AWS S3 Directory
AWS S3 File
AZLyrics
Azure AI Data
Azure Blob Storage
Azure AI Document Intelligence
BibTeX
BiliBili
Blackboard
Blockchain
Box
Brave Search
Browserbase
Browserless
BSHTMLLoader
Cassandra
ChatGPT Data
College Confidential
Concurrent Loader
Confluence
CoNLL-U
Copy Paste
Couchbase
CSV
Cube Semantic Layer
Datadog Logs
Dedoc
Diffbot
Discord
Docling
Docugami
Docusaurus
Dropbox
EPub
Etherscan
EverNote
Facebook Chat
Fauna
Figma
FireCrawl
Geopandas
Git
GitBook
GitHub
Glue Catalog
Google AlloyDB for PostgreSQL
Google BigQuery
Google Bigtable
Google Cloud SQL for SQL Server
Google Cloud SQL for MySQL
Google Cloud SQL for PostgreSQL
Google Cloud Storage Directory
Google Cloud Storage File
Google Firestore in Datastore Mode
Google Drive
Google El Carro for Oracle Workloads
Google Firestore (Native Mode)
Google Memorystore for Redis
Google Spanner
Google Speech-to-Text
Grobid
Gutenberg
Hacker News
Huawei OBS Directory
Huawei OBS File
HuggingFace Dataset
HyperbrowserLoader
iFixit
Images
Image Captions
IMSDb
Iugu
Joplin
JSONLoader
Jupyter Notebook
Kinetica
lakeFS
LangSmith
LarkSuite (FeiShu)
LLM Sherpa
Mastodon
MathPixPDFLoader
MediaWiki Dump
Merge Documents Loader
MHTML
Microsoft Excel
Microsoft OneDrive
Microsoft OneNote
Microsoft PowerPoint
Microsoft SharePoint
Microsoft Word
Near Blockchain
Modern Treasury
MongoDB
Needle Document Loader
News URL
Notion DB
Nuclia
Obsidian
OpenDataLoader PDF
Open Document Format (ODT)
Open City Data
Oracle Autonomous Database
Oracle AI Vector Search
Org-mode
Outline Document Loader
Pandas DataFrame
PDFMinerLoader
PDFPlumber
Pebblo Safe DocumentLoader
Polaris AI DataInsight
Polars DataFrame
Dell PowerScale
Psychic
PubMed
PullMdLoader
PyMuPDFLoader
PyMuPDF4LLM
PyPDFDirectoryLoader
PyPDFium2Loader
PyPDFLoader
PySpark
Quip
ReadTheDocs Documentation
Recursive URL
Roam
Rockset
rspace
RSS Feeds
RST
scrapfly
ScrapingAnt
SingleStore
Sitemap
Slack
Snowflake
Source Code
Spider
Spreedly
Stripe
Subtitle
SurrealDB
Telegram
Tencent COS Directory
Tencent COS File
TensorFlow Datasets
TiDB
2Markdown
TOML
Trello
TSV
UnDatasIO
Unstructured
UnstructuredMarkdownLoader
UnstructuredPDFLoader
Upstage
URL
Vsdx
Weather
WebBaseLoader
WhatsApp Chat
Wikipedia
UnstructuredXMLLoader
Xorbits Pandas DataFrame
YouTube Audio
YouTube Transcripts
YoutubeLoaderDL
Yuque
ZeroxPDFLoader
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.