LangSmith is a quality control lab for AI apps built with LangChain. It helps you test, measure, and improve your AI.

What is LLM-as-judge evaluation?

LLM-as-judge uses another AI to grade your AI's outputs. It's fast, cheap, consistent, and can grade thousands in minutes.

What are annotation queues in LangSmith?

Annotation queues let humans review AI outputs for high-stakes answers, edge cases, or creating gold-standard training data.

LangSmith Evaluation | LangChain Testing Guide

🔬 LangSmith Evaluation: Your AI Quality Control Lab

Imagine you’re a chef who just created a new recipe. How do you know if it’s any good? You taste it, ask friends, write notes, and keep improving. LangSmith Evaluation does the same thing—but for AI!

🎯 The Big Picture

LangSmith is like a science lab for your AI. When you build AI apps with LangChain, you need to know: “Is my AI actually good?” LangSmith helps you test, measure, and improve your AI—just like a quality control team in a factory.

Think of it this way:

Your AI = A student taking tests
LangSmith = The teacher grading papers and giving feedback

📦 LangSmith Datasets: Your Test Question Bank

What Is It?

A dataset is a collection of test questions for your AI. Just like a teacher keeps a folder of quiz questions, you keep a folder of inputs and expected outputs.

Simple Example

# Creating a dataset is like
# writing test questions
dataset = client.create_dataset(
    "Customer Support Tests",
    description="Questions customers ask"
)

# Each "example" is one test question
client.create_example(
    inputs={"question": "How do I reset?"},
    outputs={"answer": "Click Settings > Reset"},
    dataset_id=dataset.id
)

Why Does It Matter?

📝 Reusable tests — Run the same tests whenever you change your AI
📊 Track progress — See if your AI gets better over time
🎯 Catch problems — Find issues before your users do

graph TD
    A["Create Dataset"] --> B["Add Examples"]
    B --> C["Input + Expected Output"]
    C --> D["Run Tests Anytime"]
    D --> E["Compare Results"]

🧪 LangSmith Experiments: Running Your Tests

What Is It?

An experiment is when you actually run your AI against your test questions. It’s like exam day for your AI!

Simple Example

from langsmith import evaluate

# Run your AI against all test questions
results = evaluate(
    my_ai_function,
    data="Customer Support Tests",
    evaluators=[accuracy_checker],
    experiment_prefix="v1-test"
)

What Happens During an Experiment?

🎬 Your AI answers each test question
✅ LangSmith checks each answer
📈 You get scores and reports

Why Does It Matter?

A/B testing — Compare two versions of your AI
Regression testing — Make sure updates don’t break things
Confidence — Know your AI works before shipping

🤖 LLM-as-Judge Evaluation: Let AI Grade AI

What Is It?

Instead of writing complex rules to check answers, you use another AI to grade your AI! It’s like having a senior student grade the younger students’ work.

Simple Example

from langsmith.evaluation import LangChainStringEvaluator

# Create an AI judge
judge = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "helpful": "Is this helpful?",
            "accurate": "Is this correct?"
        }
    }
)

# The judge scores each answer
results = evaluate(
    my_ai,
    data="my-dataset",
    evaluators=[judge]
)

Why Use AI as a Judge?

Human Grading	AI Grading
Slow	Super fast
Expensive	Cheap
Inconsistent	Same rules every time
Limited scale	Grade thousands in minutes

Common Evaluation Criteria

✅ Correctness — Is the answer right?
💡 Helpfulness — Does it actually help?
🛡️ Safety — Is it appropriate?
📖 Coherence — Does it make sense?

📋 Annotation Queues: Human Review Station

What Is It?

Sometimes AI can’t judge everything. Annotation queues let humans review AI outputs when you need that human touch.

Think of it like a quality control conveyor belt where humans check products.

Simple Example

# Create a queue for human reviewers
queue = client.create_annotation_queue(
    name="Review Tricky Cases",
    description="Answers that need human check"
)

# Add items for review
client.add_runs_to_annotation_queue(
    queue_id=queue.id,
    run_ids=[run1.id, run2.id]
)

When to Use Human Review?

🚨 High-stakes answers — Medical, legal, financial
🤔 Edge cases — Weird inputs AI struggles with
🏷️ Training data — Create gold-standard examples
🔍 Spot checks — Random quality audits

graph TD
    A["AI Answers Question"] --> B{Confidence High?}
    B -->|Yes| C["Auto-Approve"]
    B -->|No| D["Send to Queue"]
    D --> E["Human Reviews"]
    E --> F["Feedback to Dataset"]

🔌 LangSmith Client API: Your Control Panel

What Is It?

The Client API is how you talk to LangSmith from your code. It’s like a remote control for all LangSmith features.

Setting Up

from langsmith import Client

# Connect to LangSmith
client = Client(
    api_key="your-api-key"
)

What Can You Do?

Manage Datasets:

# List all datasets
datasets = client.list_datasets()

# Get specific dataset
ds = client.read_dataset(
    dataset_name="My Tests"
)

Work with Runs:

# Get all AI runs
runs = client.list_runs(
    project_name="my-project",
    filter='eq(status, "success")'
)

Get Feedback:

# Add feedback to a run
client.create_feedback(
    run_id=run.id,
    key="quality",
    score=0.9,
    comment="Great answer!"
)

Key API Features

Feature	What It Does
`create_dataset`	Make new test set
`create_example`	Add test question
`list_runs`	See AI activity
`create_feedback`	Add scores/notes

📝 Prompt Versioning: Track Your Changes

What Is It?

Every time you change your AI’s instructions (prompts), LangSmith saves a version. Like “Save As” but automatic!

Why Does It Matter?

Imagine your AI worked great yesterday but is broken today. With versioning, you can:

🔙 Go back to the working version
📊 Compare old vs new
🎯 Find what change broke things

Simple Example

# Save a prompt version
prompt = client.push_prompt(
    "support-assistant",
    object={
        "template": "You are a helpful..."
    },
    tags=["v2", "improved"]
)

# Later, get old version
old_prompt = client.pull_prompt(
    "support-assistant:v1"
)

Version Control Best Practices

Tag meaningful changes — “added-safety-rules”
Test before promoting — Run experiments first
Keep notes — Why did you change it?

graph TD
    A["Prompt v1"] --> B["Test Results: 70%"]
    B --> C["Edit Prompt"]
    C --> D["Prompt v2"]
    D --> E["Test Results: 85%"]
    E --> F["Promote v2 to Production"]

🔍 Debugging and Logging: Find Problems Fast

What Is It?

When your AI does something weird, logging shows you exactly what happened, step by step. It’s like a security camera recording everything.

What Gets Logged?

📥 Input — What the user asked
🧠 Chain steps — Each part of your AI’s thinking
📤 Output — What your AI said
⏱️ Timing — How long each step took
💰 Tokens — How much it cost

Simple Example

import langsmith

# Enable tracing
langsmith.set_tracing_enabled(True)

# Now all your LangChain code
# is automatically logged!

Reading Logs in LangSmith

graph TD
    A["User Question"] --> B["Retrieval Step"]
    B --> C["Context Found"]
    C --> D["LLM Call"]
    D --> E["Final Answer"]

    style A fill:#e3f2fd
    style E fill:#c8e6c9

Each step shows:

What went in
What came out
How long it took
Any errors

Debugging Tips

Problem	Where to Look
Wrong answer	Check LLM input
Slow response	Check step timings
High cost	Check token counts
Missing info	Check retrieval step

Setting Log Detail Levels

# Verbose logging for debugging
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "debug-session"

🎯 Putting It All Together

Here’s how all these pieces work as a team:

graph TD
    A["Build AI with LangChain"] --> B["Create Dataset"]
    B --> C["Run Experiment"]
    C --> D["LLM-as-Judge Scores"]
    D --> E{Good Enough?}
    E -->|No| F["Check Logs"]
    F --> G["Update Prompt"]
    G --> H["Save Version"]
    H --> C
    E -->|Yes| I["Ship It!"]

    J["Annotation Queue"] --> K["Human Feedback"]
    K --> B

Quick Reference

Tool	What It Does	When to Use
Datasets	Store test cases	Before testing
Experiments	Run tests	When changing AI
LLM-as-Judge	Auto-grade answers	For scale
Annotation Queues	Human review	For accuracy
Client API	Control everything	In your code
Prompt Versioning	Track changes	Always
Debugging/Logging	Find problems	When stuck

🚀 You Did It!

You now understand how to:

✅ Create test datasets for your AI
✅ Run experiments to measure quality
✅ Use AI to grade AI outputs
✅ Set up human review workflows
✅ Control LangSmith with code
✅ Track prompt versions
✅ Debug with detailed logs

LangSmith Evaluation is your AI’s best friend—helping you build AI that’s not just smart, but reliably smart! 🎉

LangSmith Evaluation

Unable to load concept

Coming Soon...

🔬 LangSmith Evaluation: Your AI Quality Control Lab

🎯 The Big Picture

📦 LangSmith Datasets: Your Test Question Bank

What Is It?

Simple Example

Why Does It Matter?

🧪 LangSmith Experiments: Running Your Tests

What Is It?

Simple Example

What Happens During an Experiment?

Why Does It Matter?

🤖 LLM-as-Judge Evaluation: Let AI Grade AI

What Is It?

Simple Example

Why Use AI as a Judge?

Common Evaluation Criteria

📋 Annotation Queues: Human Review Station

What Is It?

Simple Example

When to Use Human Review?

🔌 LangSmith Client API: Your Control Panel

What Is It?

Setting Up

What Can You Do?

Key API Features

📝 Prompt Versioning: Track Your Changes

What Is It?

Why Does It Matter?

Simple Example

Version Control Best Practices

🔍 Debugging and Logging: Find Problems Fast

What Is It?

What Gets Logged?

Simple Example

Reading Logs in LangSmith

Debugging Tips

Setting Log Detail Levels

🎯 Putting It All Together

Quick Reference

🚀 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue