Document Processing

Back

Loading concept...

📚 Document Processing in LangChain RAG

The Library Analogy đź“–

Imagine you have a HUGE library with thousands of books. You want to find answers to questions—but reading every book would take forever!

What if you could:

  1. Bring books into your library (Document Loading)
  2. Label each book with helpful info (Metadata)
  3. Cut books into easy-to-read cards (Text Splitting)
  4. Group cards by meaning (Semantic Chunking)
  5. Make cards the perfect size (Chunk Tuning)
  6. Clean and organize cards (Document Transformers)

That’s exactly what Document Processing does for AI! Let’s explore each step.


1. Document Loading Strategies 📥

What Is It?

Document loading is how we bring information INTO our AI system. Just like opening a book before you can read it!

The Story

Think of a librarian who can read ANY type of book:

  • đź“„ Regular paper books (PDF files)
  • đź’» Computer screens (Web pages)
  • 📝 Handwritten notes (Text files)
  • 📊 Number charts (CSV/Excel)

LangChain has special “readers” called Document Loaders for each type!

Common Loaders

# Load a PDF file
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("story.pdf")
docs = loader.load()

# Load a web page
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://example.com")
docs = loader.load()

# Load a text file
from langchain.document_loaders import TextLoader
loader = TextLoader("notes.txt")
docs = loader.load()

🎯 Key Point

Different files need different loaders—like needing different keys for different doors!

graph TD A["Your Files"] --> B{File Type?} B -->|PDF| C["PyPDFLoader"] B -->|Web| D["WebBaseLoader"] B -->|Text| E["TextLoader"] B -->|CSV| F["CSVLoader"] C --> G["Documents Ready!"] D --> G E --> G F --> G

2. Document Object and Metadata 🏷️

What Is It?

A Document in LangChain is like a card with TWO parts:

  1. page_content - The actual text (what the book says)
  2. metadata - Extra info (who wrote it, when, which page)

The Story

Imagine a library card for each book:

  • Front: The story itself
  • Back: Title, author, page number, date added

This “back of the card” info helps you find things FAST!

Example

from langchain.schema import Document

# Create a document with metadata
doc = Document(
    page_content="The sun is a star.",
    metadata={
        "source": "science_book.pdf",
        "page": 42,
        "author": "Dr. Smith",
        "topic": "astronomy"
    }
)

# Access the parts
print(doc.page_content)
# "The sun is a star."

print(doc.metadata["page"])
# 42

Why Metadata Matters 🌟

Without Metadata With Metadata
“The answer is 42” “The answer is 42” from page 5 of math_guide.pdf
No context Full context!
Can’t verify Can check source

🎯 Key Point

Metadata is your treasure map—it shows exactly WHERE information came from!


3. Text Splitting Strategies ✂️

What Is It?

Text splitting means cutting BIG documents into SMALL pieces (called “chunks”).

The Story

Imagine you have a 500-page book. Your AI brain can only look at one small piece at a time. So we need to:

  • Cut the book into small cards
  • Make sure each card makes sense on its own
  • Keep related ideas together

Why Split?

  • AI models have limited memory (context window)
  • Smaller chunks = faster searching
  • Better chunks = better answers

Common Splitting Methods

1. Character Splitter (Simple cuts)

from langchain.text_splitter import (
    CharacterTextSplitter
)

splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)
chunks = splitter.split_text(long_text)

2. Recursive Splitter (Smart cuts)

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)

How Recursive Splitting Works

graph TD A["Big Document"] --> B{Try split by paragraph} B -->|Too big?| C{Try split by line} C -->|Too big?| D{Try split by word} D -->|Still big?| E["Split by character"] B -->|Good size!| F["Done âś“"] C -->|Good size!| F D -->|Good size!| F E --> F

🎯 Key Point

RecursiveCharacterTextSplitter is the BEST for most cases—it keeps ideas together!


4. Semantic Chunking đź§ 

What Is It?

Semantic chunking splits text by MEANING, not just by size. It keeps related ideas together!

The Story

Regular splitting is like cutting a cake with a ruler—you might cut through the best part!

Semantic splitting is like cutting between layers—each piece is complete and delicious!

How It Works

  1. Look at each sentence
  2. Check if it’s SIMILAR to nearby sentences
  3. Keep similar sentences together
  4. Split when meaning CHANGES

Example

from langchain_experimental.text_splitter import (
    SemanticChunker
)
from langchain.embeddings import OpenAIEmbeddings

# Create semantic chunker
chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)

# Split by meaning!
chunks = chunker.split_text(long_text)

Regular vs Semantic Splitting

Regular Splitting Semantic Splitting
Cuts at character count Cuts at meaning boundaries
May split mid-sentence Keeps complete thoughts
Fast but rough Smarter but slower
Good for simple text Great for complex topics

🎯 Key Point

Use semantic chunking when your content has different topics mixed together!


5. Chunk Size and Overlap Tuning ⚙️

What Is It?

Chunk size = How big each piece should be Overlap = How much pieces should share at the edges

The Story

Imagine cutting a photo into puzzle pieces:

  • Too small: You can’t see the picture in each piece
  • Too big: Pieces don’t fit together well
  • Overlap: Like puzzle pieces with matching edges—they connect better!

The Magic Numbers

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,    # Size of each chunk
    chunk_overlap=50   # Shared text between chunks
)

Visualizing Overlap

Chunk 1: [AAAAAAAAAABBBB]
Chunk 2:           [BBBBCCCCCCCCCC]
Chunk 3:                     [CCCCDDDDDDDDDD]

The "BBBB" and "CCCC" parts OVERLAP!
This keeps context connected.

Tuning Guide

Chunk Size Best For
100-300 Short Q&A, definitions
300-500 General documents
500-1000 Technical/detailed content
1000+ Long-form analysis
Overlap When To Use
0% Independent facts
10-15% General documents
20-30% Connected narratives

Example Tuning

# For a FAQ document (short answers)
faq_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20
)

# For a research paper (dense content)
paper_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100
)

🎯 Key Point

Start with 500/50 (chunk_size/overlap), then adjust based on your results!


6. Document Transformers 🔄

What Is It?

Document Transformers clean up and improve your chunks AFTER splitting.

The Story

After cutting your book into cards, you might want to:

  • Remove duplicate cards
  • Clean up messy text
  • Add more helpful labels
  • Translate languages

Common Transformers

1. Remove Duplicates

from langchain.document_transformers import (
    EmbeddingsRedundantFilter
)

filter = EmbeddingsRedundantFilter(
    embeddings=embeddings
)
unique_docs = filter.transform_documents(docs)

2. Clean HTML

from langchain.document_transformers import (
    Html2TextTransformer
)

transformer = Html2TextTransformer()
clean_docs = transformer.transform_documents(
    html_docs
)

3. Add Context with LLM

from langchain.document_transformers import (
    LongContextReorder
)

reorderer = LongContextReorder()
ordered_docs = reorderer.transform_documents(
    docs
)

Transformer Pipeline

graph TD A["Raw Documents"] --> B["Split into Chunks"] B --> C["Remove Duplicates"] C --> D["Clean HTML/Formatting"] D --> E["Reorder for Context"] E --> F["Ready for AI! ✨"]

🎯 Key Point

Transformers are your clean-up crew—they make your chunks perfect for AI!


🎉 The Complete Pipeline

Now you understand the WHOLE process:

graph TD A["📄 Your Files"] --> B["📥 Load Documents"] B --> C["🏷️ Add Metadata"] C --> D["✂️ Split Text"] D --> E["🧠 Semantic Chunking"] E --> F["⚙️ Tune Size/Overlap"] F --> G["🔄 Transform & Clean"] G --> H["🚀 Ready for RAG!"]

Full Example

# 1. Load
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("my_book.pdf")
docs = loader.load()

# 2. Split with good settings
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter
)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(docs)

# 3. Transform (remove duplicates)
from langchain.document_transformers import (
    EmbeddingsRedundantFilter
)
filter = EmbeddingsRedundantFilter(embeddings)
final_docs = filter.transform_documents(chunks)

# Now ready for vector storage & RAG! 🎯

🌟 Summary

Step Purpose Think Of It As…
Loading Bring files in Opening a book
Metadata Add helpful labels Library card
Splitting Cut into pieces Making flashcards
Semantic Split by meaning Grouping related cards
Tuning Perfect the size Finding the sweet spot
Transform Clean and polish Final inspection

đź’Ş You Did It!

You now understand how to prepare documents for RAG:

  • âś… Load any file type
  • âś… Attach useful metadata
  • âś… Split text smartly
  • âś… Keep meanings together
  • âś… Tune for best results
  • âś… Clean and transform

Your AI can now read your library like a pro! 📚🤖

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.