Vector Store Basics

Back

Loading concept...

🏛️ Vector Store Basics: Your AI’s Super Library

Imagine you’re building the world’s smartest library. Not one where books sit on dusty shelves—but one where the librarian instantly knows which books are most similar to your thoughts, even if you don’t know the exact title!


🎯 The Big Picture

You want your AI to remember things and find related information fast. That’s what Vector Stores do. They’re like magical filing cabinets that understand meaning, not just keywords.


📚 What is a Vector Store?

The Simple Story

Think of a vector as a secret code that captures what something means.

Example:

  • The word “puppy” might become [0.8, 0.2, 0.9]
  • The word “dog” might become [0.7, 0.3, 0.85]
  • The word “car” might become [0.1, 0.9, 0.1]

Notice how “puppy” and “dog” have similar numbers? That’s because they mean similar things!

A Vector Store is a special database that:

  1. Stores these secret codes (vectors)
  2. Finds similar codes super fast
  3. Returns the original text you stored

Real Life Example

You ask: “What’s a good pet for kids?”

The vector store thinks:

“Hmm, this question is similar to documents about puppies, cats, and hamsters… NOT similar to documents about cars or computers!”

Then it returns the most relevant documents!


🧠 Vector Store Fundamentals

The Three Magic Steps

graph TD A["📄 Your Text"] --> B["🔢 Convert to Vector"] B --> C["💾 Store in Database"] C --> D["🔍 Search by Similarity"]

Step 1: Embedding (Making the Secret Code)

Your text becomes a list of numbers called an embedding.

# LangChain makes this easy!
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vector = embeddings.embed_query("I love pizza")
# Returns: [0.01, -0.02, 0.05, ...]

Step 2: Storing (Putting It in the Library)

The vector + original text go into the store together.

Step 3: Searching (Finding Similar Things)

When you search, your question becomes a vector too. The store finds vectors that are “close” to yours!

Think of it like this:

  • Your question is a point on a map
  • Stored documents are other points
  • The store finds the nearest neighbors!

🛠️ Vector Store Options in LangChain

LangChain works with MANY vector stores. Here are the popular ones:

Quick Comparison

Store Best For Setup
Chroma Getting started Easy
FAISS Fast local search Easy
Pinecone Production apps Medium
Weaviate Complex queries Medium
Qdrant Large scale Medium

🥇 Chroma (Perfect for Learning!)

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Create a vector store
vectorstore = Chroma(
    embedding_function=OpenAIEmbeddings()
)

Why Chroma?

  • Runs locally (no internet needed!)
  • Zero setup headaches
  • Great for prototyping

🚀 FAISS (Super Fast!)

from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(
    documents,
    OpenAIEmbeddings()
)

Why FAISS?

  • Made by Facebook AI
  • Blazingly fast searches
  • Works on your laptop

☁️ Pinecone (For Real Apps)

from langchain_pinecone import PineconeVectorStore

vectorstore = PineconeVectorStore(
    index_name="my-index",
    embedding=OpenAIEmbeddings()
)

Why Pinecone?

  • Cloud-hosted (always available)
  • Scales to millions of vectors
  • Production-ready

📥 Adding and Indexing Documents

The Journey of a Document

graph TD A["📄 Raw Document"] --> B["✂️ Split into Chunks"] B --> C["🔢 Create Embeddings"] C --> D["💾 Store with Metadata"] D --> E["✅ Ready to Search!"]

Method 1: Add Texts Directly

texts = [
    "Dogs are loyal pets.",
    "Cats are independent.",
    "Fish need aquariums."
]

# Add to vector store
vectorstore.add_texts(texts)

Method 2: Add Documents with Metadata

from langchain.schema import Document

docs = [
    Document(
        page_content="Dogs are loyal pets.",
        metadata={"animal": "dog", "type": "pet"}
    ),
    Document(
        page_content="Cats are independent.",
        metadata={"animal": "cat", "type": "pet"}
    )
]

vectorstore.add_documents(docs)

Why Metadata Matters:

  • Filter searches: “Only show me dog articles!”
  • Track sources: “Where did this info come from?”
  • Add context: “When was this written?”

Method 3: Create from Documents (All at Once!)

# Load, split, and index in one go!
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings()
)

🎯 Chunking: Why Size Matters

Big documents need to be split into smaller pieces:

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # Characters per chunk
    chunk_overlap=50     # Overlap between chunks
)

chunks = splitter.split_documents(docs)

The Goldilocks Rule:

  • Too big → Loses focus, wastes tokens
  • Too small → Loses context, misses meaning
  • Just right → 500-1000 characters usually works!

🔌 The Indexing API

LangChain’s Indexing API is your smart assistant that:

  • ✅ Avoids duplicates (no wasted storage!)
  • ✅ Tracks what’s been indexed
  • ✅ Updates only what changed
  • ✅ Deletes outdated content

The Problem It Solves

Without the Indexing API:

  • Re-run your script → Duplicates everywhere!
  • Update a document → Old version still there!
  • Delete a source → Ghost data haunts you!

Setting Up the Indexing API

from langchain.indexes import SQLRecordManager
from langchain.indexes import index

# Create a record manager
record_manager = SQLRecordManager(
    namespace="my_docs",
    db_url="sqlite:///records.db"
)

# Initialize it
record_manager.create_schema()

Indexing Modes Explained

# Mode 1: "None" - Just add, ignore duplicates
index(docs, record_manager, vectorstore,
      cleanup=None)

# Mode 2: "Incremental" - Smart updates
index(docs, record_manager, vectorstore,
      cleanup="incremental")

# Mode 3: "Full" - Complete sync
index(docs, record_manager, vectorstore,
      cleanup="full")
Mode What It Does
None Adds everything (may duplicate)
Incremental Adds new, skips existing
Full Adds new, removes missing

Real-World Example

# First run: indexes 3 documents
docs = [doc1, doc2, doc3]
index(docs, record_manager, vectorstore,
      cleanup="full")
# Result: 3 docs in store

# Second run: doc3 removed, doc4 added
docs = [doc1, doc2, doc4]
index(docs, record_manager, vectorstore,
      cleanup="full")
# Result: doc3 deleted, doc4 added!
# Only doc1, doc2, doc4 remain

Source IDs: Track Your Documents

# Add source tracking
index(
    docs,
    record_manager,
    vectorstore,
    cleanup="full",
    source_id_key="source"  # Uses metadata
)

Now each document knows where it came from!


🔍 Searching Your Vector Store

Once indexed, searching is magical:

# Simple search
results = vectorstore.similarity_search(
    "What pet is best for kids?",
    k=3  # Return top 3 matches
)

# Search with scores
results = vectorstore.similarity_search_with_score(
    "What pet is best for kids?",
    k=3
)

Filter by Metadata

# Only search dog documents
results = vectorstore.similarity_search(
    "training tips",
    k=3,
    filter={"animal": "dog"}
)

🎉 Putting It All Together

Here’s a complete mini-project:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

# 1. Create documents
docs = [
    Document(
        page_content="Golden Retrievers are friendly.",
        metadata={"animal": "dog"}
    ),
    Document(
        page_content="Siamese cats are vocal.",
        metadata={"animal": "cat"}
    ),
    Document(
        page_content="Goldfish are easy to care for.",
        metadata={"animal": "fish"}
    )
]

# 2. Create vector store
vectorstore = Chroma.from_documents(
    docs,
    OpenAIEmbeddings()
)

# 3. Search!
results = vectorstore.similarity_search(
    "I want a friendly pet",
    k=2
)

# Output: Golden Retriever doc comes first!

🧩 Key Takeaways

  1. Vectors are number-lists that capture meaning
  2. Vector Stores save and search these vectors
  3. Many options: Chroma (easy), FAISS (fast), Pinecone (production)
  4. Add documents with add_texts() or add_documents()
  5. Indexing API prevents duplicates and keeps data fresh
  6. Search finds similar content by meaning, not keywords!

🚀 You’re Ready!

You now understand how to:

  • ✅ Pick the right vector store
  • ✅ Add and index documents properly
  • ✅ Use the Indexing API like a pro
  • ✅ Search by meaning, not just keywords

Next up: Use this with RAG to build AI that answers questions from YOUR data!

Remember: Vector stores are just smart filing cabinets. You put stuff in, they remember the meaning, and find similar things lightning-fast!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.