📚 Document Processing in LangChain RAG
The Library Analogy đź“–
Imagine you have a HUGE library with thousands of books. You want to find answers to questions—but reading every book would take forever!
What if you could:
- Bring books into your library (Document Loading)
- Label each book with helpful info (Metadata)
- Cut books into easy-to-read cards (Text Splitting)
- Group cards by meaning (Semantic Chunking)
- Make cards the perfect size (Chunk Tuning)
- Clean and organize cards (Document Transformers)
That’s exactly what Document Processing does for AI! Let’s explore each step.
1. Document Loading Strategies 📥
What Is It?
Document loading is how we bring information INTO our AI system. Just like opening a book before you can read it!
The Story
Think of a librarian who can read ANY type of book:
- đź“„ Regular paper books (PDF files)
- đź’» Computer screens (Web pages)
- 📝 Handwritten notes (Text files)
- 📊 Number charts (CSV/Excel)
LangChain has special “readers” called Document Loaders for each type!
Common Loaders
# Load a PDF file
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("story.pdf")
docs = loader.load()
# Load a web page
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://example.com")
docs = loader.load()
# Load a text file
from langchain.document_loaders import TextLoader
loader = TextLoader("notes.txt")
docs = loader.load()
🎯 Key Point
Different files need different loaders—like needing different keys for different doors!
graph TD A["Your Files"] --> B{File Type?} B -->|PDF| C["PyPDFLoader"] B -->|Web| D["WebBaseLoader"] B -->|Text| E["TextLoader"] B -->|CSV| F["CSVLoader"] C --> G["Documents Ready!"] D --> G E --> G F --> G
2. Document Object and Metadata 🏷️
What Is It?
A Document in LangChain is like a card with TWO parts:
- page_content - The actual text (what the book says)
- metadata - Extra info (who wrote it, when, which page)
The Story
Imagine a library card for each book:
- Front: The story itself
- Back: Title, author, page number, date added
This “back of the card” info helps you find things FAST!
Example
from langchain.schema import Document
# Create a document with metadata
doc = Document(
page_content="The sun is a star.",
metadata={
"source": "science_book.pdf",
"page": 42,
"author": "Dr. Smith",
"topic": "astronomy"
}
)
# Access the parts
print(doc.page_content)
# "The sun is a star."
print(doc.metadata["page"])
# 42
Why Metadata Matters 🌟
| Without Metadata | With Metadata |
|---|---|
| “The answer is 42” | “The answer is 42” from page 5 of math_guide.pdf |
| No context | Full context! |
| Can’t verify | Can check source |
🎯 Key Point
Metadata is your treasure map—it shows exactly WHERE information came from!
3. Text Splitting Strategies ✂️
What Is It?
Text splitting means cutting BIG documents into SMALL pieces (called “chunks”).
The Story
Imagine you have a 500-page book. Your AI brain can only look at one small piece at a time. So we need to:
- Cut the book into small cards
- Make sure each card makes sense on its own
- Keep related ideas together
Why Split?
- AI models have limited memory (context window)
- Smaller chunks = faster searching
- Better chunks = better answers
Common Splitting Methods
1. Character Splitter (Simple cuts)
from langchain.text_splitter import (
CharacterTextSplitter
)
splitter = CharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)
chunks = splitter.split_text(long_text)
2. Recursive Splitter (Smart cuts)
from langchain.text_splitter import (
RecursiveCharacterTextSplitter
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)
How Recursive Splitting Works
graph TD A["Big Document"] --> B{Try split by paragraph} B -->|Too big?| C{Try split by line} C -->|Too big?| D{Try split by word} D -->|Still big?| E["Split by character"] B -->|Good size!| F["Done âś“"] C -->|Good size!| F D -->|Good size!| F E --> F
🎯 Key Point
RecursiveCharacterTextSplitter is the BEST for most cases—it keeps ideas together!
4. Semantic Chunking đź§
What Is It?
Semantic chunking splits text by MEANING, not just by size. It keeps related ideas together!
The Story
Regular splitting is like cutting a cake with a ruler—you might cut through the best part!
Semantic splitting is like cutting between layers—each piece is complete and delicious!
How It Works
- Look at each sentence
- Check if it’s SIMILAR to nearby sentences
- Keep similar sentences together
- Split when meaning CHANGES
Example
from langchain_experimental.text_splitter import (
SemanticChunker
)
from langchain.embeddings import OpenAIEmbeddings
# Create semantic chunker
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
# Split by meaning!
chunks = chunker.split_text(long_text)
Regular vs Semantic Splitting
| Regular Splitting | Semantic Splitting |
|---|---|
| Cuts at character count | Cuts at meaning boundaries |
| May split mid-sentence | Keeps complete thoughts |
| Fast but rough | Smarter but slower |
| Good for simple text | Great for complex topics |
🎯 Key Point
Use semantic chunking when your content has different topics mixed together!
5. Chunk Size and Overlap Tuning ⚙️
What Is It?
Chunk size = How big each piece should be Overlap = How much pieces should share at the edges
The Story
Imagine cutting a photo into puzzle pieces:
- Too small: You can’t see the picture in each piece
- Too big: Pieces don’t fit together well
- Overlap: Like puzzle pieces with matching edges—they connect better!
The Magic Numbers
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Size of each chunk
chunk_overlap=50 # Shared text between chunks
)
Visualizing Overlap
Chunk 1: [AAAAAAAAAABBBB]
Chunk 2: [BBBBCCCCCCCCCC]
Chunk 3: [CCCCDDDDDDDDDD]
The "BBBB" and "CCCC" parts OVERLAP!
This keeps context connected.
Tuning Guide
| Chunk Size | Best For |
|---|---|
| 100-300 | Short Q&A, definitions |
| 300-500 | General documents |
| 500-1000 | Technical/detailed content |
| 1000+ | Long-form analysis |
| Overlap | When To Use |
|---|---|
| 0% | Independent facts |
| 10-15% | General documents |
| 20-30% | Connected narratives |
Example Tuning
# For a FAQ document (short answers)
faq_splitter = RecursiveCharacterTextSplitter(
chunk_size=200,
chunk_overlap=20
)
# For a research paper (dense content)
paper_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100
)
🎯 Key Point
Start with 500/50 (chunk_size/overlap), then adjust based on your results!
6. Document Transformers 🔄
What Is It?
Document Transformers clean up and improve your chunks AFTER splitting.
The Story
After cutting your book into cards, you might want to:
- Remove duplicate cards
- Clean up messy text
- Add more helpful labels
- Translate languages
Common Transformers
1. Remove Duplicates
from langchain.document_transformers import (
EmbeddingsRedundantFilter
)
filter = EmbeddingsRedundantFilter(
embeddings=embeddings
)
unique_docs = filter.transform_documents(docs)
2. Clean HTML
from langchain.document_transformers import (
Html2TextTransformer
)
transformer = Html2TextTransformer()
clean_docs = transformer.transform_documents(
html_docs
)
3. Add Context with LLM
from langchain.document_transformers import (
LongContextReorder
)
reorderer = LongContextReorder()
ordered_docs = reorderer.transform_documents(
docs
)
Transformer Pipeline
graph TD A["Raw Documents"] --> B["Split into Chunks"] B --> C["Remove Duplicates"] C --> D["Clean HTML/Formatting"] D --> E["Reorder for Context"] E --> F["Ready for AI! ✨"]
🎯 Key Point
Transformers are your clean-up crew—they make your chunks perfect for AI!
🎉 The Complete Pipeline
Now you understand the WHOLE process:
graph TD A["📄 Your Files"] --> B["📥 Load Documents"] B --> C["🏷️ Add Metadata"] C --> D["✂️ Split Text"] D --> E["🧠Semantic Chunking"] E --> F["⚙️ Tune Size/Overlap"] F --> G["🔄 Transform & Clean"] G --> H["🚀 Ready for RAG!"]
Full Example
# 1. Load
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("my_book.pdf")
docs = loader.load()
# 2. Split with good settings
from langchain.text_splitter import (
RecursiveCharacterTextSplitter
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(docs)
# 3. Transform (remove duplicates)
from langchain.document_transformers import (
EmbeddingsRedundantFilter
)
filter = EmbeddingsRedundantFilter(embeddings)
final_docs = filter.transform_documents(chunks)
# Now ready for vector storage & RAG! 🎯
🌟 Summary
| Step | Purpose | Think Of It As… |
|---|---|---|
| Loading | Bring files in | Opening a book |
| Metadata | Add helpful labels | Library card |
| Splitting | Cut into pieces | Making flashcards |
| Semantic | Split by meaning | Grouping related cards |
| Tuning | Perfect the size | Finding the sweet spot |
| Transform | Clean and polish | Final inspection |
đź’Ş You Did It!
You now understand how to prepare documents for RAG:
- âś… Load any file type
- âś… Attach useful metadata
- âś… Split text smartly
- âś… Keep meanings together
- âś… Tune for best results
- âś… Clean and transform
Your AI can now read your library like a pro! 📚🤖
