Microsoft SharePoint is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.This notebook covers how to load documents from the SharePoint Document Library. By default the document loader loads
pdf, doc, docx and txt files. You can load other file types by providing appropriate parsers (see more below).
Prerequisites
- Register an application with the Microsoft identity platform instructions.
- When registration finishes, the Azure portal displays the app registration’s Overview pane. You see the Application (client) ID. Also called the client ID, this value uniquely identifies your application in the Microsoft identity platform.
- During the steps you will be following at item 1, you can set the redirect URI as https://login.microsoftonline.com/common/oauth2/nativeclient
- During the steps you will be following at item 1, generate a new password (client_secret) under Application Secrets section.
- Follow the instructions at this document to add the following SCOPES(offline_accessandSites.Read.All) to your application.
- To retrieve files from your Document Library, you will need its ID. To obtain it, you will need values of Tenant Name,Collection ID, andSubsite ID.
- To find your Tenant Namefollow the instructions at this document. Once you got this, just remove.onmicrosoft.comfrom the value and hold the rest as yourTenant Name.
- To obtain your Collection IDandSubsite ID, you will need your SharePointsite-name. YourSharePointsite URL has the following formathttps://<tenant-name>.sharepoint.com/sites/<site-name>. The last part of this URL is thesite-name.
- To Get the Site Collection ID, hit this URL in the browser:https://<tenant>.sharepoint.com/sites/<site-name>/_api/site/idand copy the value of theEdm.Guidproperty.
- To get the Subsite ID(or web ID) use:https://<tenant>.sharepoint.com/sites/<site-name>/_api/web/idand copy the value of theEdm.Guidproperty.
- The SharePoint site IDhas the following format:<tenant-name>.sharepoint.com,<Collection ID>,<subsite ID>. You can hold that value to use in the next step.
- Visit the Graph Explorer Playground to obtain your Document Library ID. The first step is to ensure you are logged in with the account associated with your SharePoint site. Then you need to make a request tohttps://graph.microsoft.com/v1.0/sites/<SharePoint site ID>/driveand the response will return a payload with a fieldidthat holds the ID of yourDocument Library ID.
🧑 Instructions for ingesting your documents from SharePoint Document Library
🔑 Authentication
By default, theSharePointLoader expects that the values of CLIENT_ID and CLIENT_SECRET must be stored as environment variables named O365_CLIENT_ID and O365_CLIENT_SECRET respectively. You could pass those environment variables through a .env file at the root of your application or using the following command in your script.
o365_token.txt) at ~/.credentials/ folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change the auth_with_token parameter to True in the instantiation of the loader.
🗂️ Documents loader
📑 Loading documents from a Document Library Directory
SharePointLoader can load documents from a specific folder within your Document Library. For instance, you want to load all documents that are stored at Documents/marketing folder within your Document Library.
Resource not found for the segment, try using the folder_id instead of the folder path, which can be obtained from the Microsoft Graph API
folder_id, folder_path and documents_ids and loader will load root directory.
recursive=True you can simply load all documents from whole SharePoint:
📑 Loading documents from a list of Documents IDs
Another possibility is to provide a list ofobject_id for each document you want to load. For that, you will need to query the Microsoft Graph API to find all the documents ID that you are interested in. This link provides a list of endpoints that will be helpful to retrieve the documents ID.
For instance, to retrieve information about all objects that are stored at data/finance/ folder, you need make a request to: https://graph.microsoft.com/v1.0/drives/<document-library-id>/root:/data/finance:/children. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.
📑 Choosing supported file types and preffered parsers
By defaultSharePointLoader loads file types defined in document_loaders/parsers/registry using the default parsers (see below).
handlers argument to SharePointLoader.
Pass a dictionary mapping either file extensions (like "doc", "pdf", etc.)
or MIME types (like "application/pdf", "text/plain", etc.) to parsers.
Note that you must use either file extensions or MIME types exclusively and
cannot mix them.
Do not include the leading dot for file extensions.
Connect these docs programmatically to Claude, VSCode, and more via MCP for    real-time answers.