What is document processing in LangChain?

Document processing loads files, adds metadata, splits text into chunks, and transforms them for RAG. It prepares documents so AI can search and answer questions.

What is semantic chunking?

Semantic chunking splits text by meaning, not size. It keeps related sentences together and splits when meaning changes, creating better chunks for AI.

What chunk size and overlap should I use?

Start with chunk_size=500 and overlap=50. Use smaller chunks (200) for FAQs, larger (800+) for technical docs. Adjust based on your results.

Document Processing | LangChain RAG Guide

📚 Document Processing in LangChain RAG

The Library Analogy 📖

Imagine you have a HUGE library with thousands of books. You want to find answers to questions—but reading every book would take forever!

What if you could:

Bring books into your library (Document Loading)
Label each book with helpful info (Metadata)
Cut books into easy-to-read cards (Text Splitting)
Group cards by meaning (Semantic Chunking)
Make cards the perfect size (Chunk Tuning)
Clean and organize cards (Document Transformers)

That’s exactly what Document Processing does for AI! Let’s explore each step.

1. Document Loading Strategies 📥

What Is It?

Document loading is how we bring information INTO our AI system. Just like opening a book before you can read it!

The Story

Think of a librarian who can read ANY type of book:

📄 Regular paper books (PDF files)
💻 Computer screens (Web pages)
📝 Handwritten notes (Text files)
📊 Number charts (CSV/Excel)

LangChain has special “readers” called Document Loaders for each type!

Common Loaders

# Load a PDF file
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("story.pdf")
docs = loader.load()

# Load a web page
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://example.com")
docs = loader.load()

# Load a text file
from langchain.document_loaders import TextLoader
loader = TextLoader("notes.txt")
docs = loader.load()

🎯 Key Point

Different files need different loaders—like needing different keys for different doors!

graph TD
    A["Your Files"] --> B{File Type?}
    B -->|PDF| C["PyPDFLoader"]
    B -->|Web| D["WebBaseLoader"]
    B -->|Text| E["TextLoader"]
    B -->|CSV| F["CSVLoader"]
    C --> G["Documents Ready!"]
    D --> G
    E --> G
    F --> G

2. Document Object and Metadata 🏷️

What Is It?

A Document in LangChain is like a card with TWO parts:

page_content - The actual text (what the book says)
metadata - Extra info (who wrote it, when, which page)

The Story

Imagine a library card for each book:

Front: The story itself
Back: Title, author, page number, date added

This “back of the card” info helps you find things FAST!

Example

from langchain.schema import Document

# Create a document with metadata
doc = Document(
    page_content="The sun is a star.",
    metadata={
        "source": "science_book.pdf",
        "page": 42,
        "author": "Dr. Smith",
        "topic": "astronomy"
    }
)

# Access the parts
print(doc.page_content)
# "The sun is a star."

print(doc.metadata["page"])
# 42

Why Metadata Matters 🌟

Without Metadata	With Metadata
“The answer is 42”	“The answer is 42” from page 5 of math_guide.pdf
No context	Full context!
Can’t verify	Can check source

🎯 Key Point

Metadata is your treasure map—it shows exactly WHERE information came from!

3. Text Splitting Strategies ✂️

What Is It?

Text splitting means cutting BIG documents into SMALL pieces (called “chunks”).

The Story

Imagine you have a 500-page book. Your AI brain can only look at one small piece at a time. So we need to:

Cut the book into small cards
Make sure each card makes sense on its own
Keep related ideas together

Why Split?

AI models have limited memory (context window)
Smaller chunks = faster searching
Better chunks = better answers

Common Splitting Methods

1. Character Splitter (Simple cuts)

from langchain.text_splitter import (
    CharacterTextSplitter
)

splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)
chunks = splitter.split_text(long_text)

2. Recursive Splitter (Smart cuts)

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)

How Recursive Splitting Works

graph TD
    A["Big Document"] --> B{Try split by paragraph}
    B -->|Too big?| C{Try split by line}
    C -->|Too big?| D{Try split by word}
    D -->|Still big?| E["Split by character"]
    B -->|Good size!| F["Done ✓"]
    C -->|Good size!| F
    D -->|Good size!| F
    E --> F

🎯 Key Point

RecursiveCharacterTextSplitter is the BEST for most cases—it keeps ideas together!

4. Semantic Chunking 🧠

What Is It?

Semantic chunking splits text by MEANING, not just by size. It keeps related ideas together!

The Story

Regular splitting is like cutting a cake with a ruler—you might cut through the best part!

Semantic splitting is like cutting between layers—each piece is complete and delicious!

How It Works

Look at each sentence
Check if it’s SIMILAR to nearby sentences
Keep similar sentences together
Split when meaning CHANGES

Example

from langchain_experimental.text_splitter import (
    SemanticChunker
)
from langchain.embeddings import OpenAIEmbeddings

# Create semantic chunker
chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)

# Split by meaning!
chunks = chunker.split_text(long_text)

Regular vs Semantic Splitting

Regular Splitting	Semantic Splitting
Cuts at character count	Cuts at meaning boundaries
May split mid-sentence	Keeps complete thoughts
Fast but rough	Smarter but slower
Good for simple text	Great for complex topics

🎯 Key Point

Use semantic chunking when your content has different topics mixed together!

5. Chunk Size and Overlap Tuning ⚙️

What Is It?

Chunk size = How big each piece should be Overlap = How much pieces should share at the edges

The Story

Imagine cutting a photo into puzzle pieces:

Too small: You can’t see the picture in each piece
Too big: Pieces don’t fit together well
Overlap: Like puzzle pieces with matching edges—they connect better!

The Magic Numbers

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,    # Size of each chunk
    chunk_overlap=50   # Shared text between chunks
)

Visualizing Overlap

Chunk 1: [AAAAAAAAAABBBB]
Chunk 2:           [BBBBCCCCCCCCCC]
Chunk 3:                     [CCCCDDDDDDDDDD]

The "BBBB" and "CCCC" parts OVERLAP!
This keeps context connected.

Tuning Guide

Chunk Size	Best For
100-300	Short Q&A, definitions
300-500	General documents
500-1000	Technical/detailed content
1000+	Long-form analysis

Overlap	When To Use
0%	Independent facts
10-15%	General documents
20-30%	Connected narratives

Example Tuning

# For a FAQ document (short answers)
faq_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20
)

# For a research paper (dense content)
paper_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100
)

🎯 Key Point

Start with 500/50 (chunk_size/overlap), then adjust based on your results!

6. Document Transformers 🔄

What Is It?

Document Transformers clean up and improve your chunks AFTER splitting.

The Story

After cutting your book into cards, you might want to:

Remove duplicate cards
Clean up messy text
Add more helpful labels
Translate languages

Common Transformers

1. Remove Duplicates

from langchain.document_transformers import (
    EmbeddingsRedundantFilter
)

filter = EmbeddingsRedundantFilter(
    embeddings=embeddings
)
unique_docs = filter.transform_documents(docs)

2. Clean HTML

from langchain.document_transformers import (
    Html2TextTransformer
)

transformer = Html2TextTransformer()
clean_docs = transformer.transform_documents(
    html_docs
)

3. Add Context with LLM

from langchain.document_transformers import (
    LongContextReorder
)

reorderer = LongContextReorder()
ordered_docs = reorderer.transform_documents(
    docs
)

Transformer Pipeline

graph TD
    A["Raw Documents"] --> B["Split into Chunks"]
    B --> C["Remove Duplicates"]
    C --> D["Clean HTML/Formatting"]
    D --> E["Reorder for Context"]
    E --> F["Ready for AI! ✨"]

🎯 Key Point

Transformers are your clean-up crew—they make your chunks perfect for AI!

🎉 The Complete Pipeline

Now you understand the WHOLE process:

graph TD
    A["📄 Your Files"] --> B["📥 Load Documents"]
    B --> C["🏷️ Add Metadata"]
    C --> D["✂️ Split Text"]
    D --> E["🧠 Semantic Chunking"]
    E --> F["⚙️ Tune Size/Overlap"]
    F --> G["🔄 Transform &amp; Clean"]
    G --> H["🚀 Ready for RAG!"]

Full Example

# 1. Load
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("my_book.pdf")
docs = loader.load()

# 2. Split with good settings
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter
)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(docs)

# 3. Transform (remove duplicates)
from langchain.document_transformers import (
    EmbeddingsRedundantFilter
)
filter = EmbeddingsRedundantFilter(embeddings)
final_docs = filter.transform_documents(chunks)

# Now ready for vector storage & RAG! 🎯

🌟 Summary

Step	Purpose	Think Of It As…
Loading	Bring files in	Opening a book
Metadata	Add helpful labels	Library card
Splitting	Cut into pieces	Making flashcards
Semantic	Split by meaning	Grouping related cards
Tuning	Perfect the size	Finding the sweet spot
Transform	Clean and polish	Final inspection

💪 You Did It!

You now understand how to prepare documents for RAG:

✅ Load any file type
✅ Attach useful metadata
✅ Split text smartly
✅ Keep meanings together
✅ Tune for best results
✅ Clean and transform

Your AI can now read your library like a pro! 📚🤖

Document Processing

Unable to load concept

Coming Soon...

📚 Document Processing in LangChain RAG

The Library Analogy 📖

1. Document Loading Strategies 📥

What Is It?

The Story

Common Loaders

🎯 Key Point

2. Document Object and Metadata 🏷️

What Is It?

The Story

Example

Why Metadata Matters 🌟

🎯 Key Point

3. Text Splitting Strategies ✂️

What Is It?

The Story

Why Split?

Common Splitting Methods

How Recursive Splitting Works

🎯 Key Point

4. Semantic Chunking 🧠

What Is It?

The Story

How It Works

Example

Regular vs Semantic Splitting

🎯 Key Point

5. Chunk Size and Overlap Tuning ⚙️

What Is It?

The Story

The Magic Numbers

Visualizing Overlap

Tuning Guide

Example Tuning

🎯 Key Point

6. Document Transformers 🔄

What Is It?

The Story

Common Transformers

Transformer Pipeline

🎯 Key Point

🎉 The Complete Pipeline

Full Example

🌟 Summary

💪 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue