๐ฌ LangSmith Evaluation: Your AI Quality Control Lab
Imagine youโre a chef who just created a new recipe. How do you know if itโs any good? You taste it, ask friends, write notes, and keep improving. LangSmith Evaluation does the same thingโbut for AI!
๐ฏ The Big Picture
LangSmith is like a science lab for your AI. When you build AI apps with LangChain, you need to know: โIs my AI actually good?โ LangSmith helps you test, measure, and improve your AIโjust like a quality control team in a factory.
Think of it this way:
- Your AI = A student taking tests
- LangSmith = The teacher grading papers and giving feedback
๐ฆ LangSmith Datasets: Your Test Question Bank
What Is It?
A dataset is a collection of test questions for your AI. Just like a teacher keeps a folder of quiz questions, you keep a folder of inputs and expected outputs.
Simple Example
# Creating a dataset is like
# writing test questions
dataset = client.create_dataset(
"Customer Support Tests",
description="Questions customers ask"
)
# Each "example" is one test question
client.create_example(
inputs={"question": "How do I reset?"},
outputs={"answer": "Click Settings > Reset"},
dataset_id=dataset.id
)
Why Does It Matter?
- ๐ Reusable tests โ Run the same tests whenever you change your AI
- ๐ Track progress โ See if your AI gets better over time
- ๐ฏ Catch problems โ Find issues before your users do
graph TD A["Create Dataset"] --> B["Add Examples"] B --> C["Input + Expected Output"] C --> D["Run Tests Anytime"] D --> E["Compare Results"]
๐งช LangSmith Experiments: Running Your Tests
What Is It?
An experiment is when you actually run your AI against your test questions. Itโs like exam day for your AI!
Simple Example
from langsmith import evaluate
# Run your AI against all test questions
results = evaluate(
my_ai_function,
data="Customer Support Tests",
evaluators=[accuracy_checker],
experiment_prefix="v1-test"
)
What Happens During an Experiment?
- ๐ฌ Your AI answers each test question
- โ LangSmith checks each answer
- ๐ You get scores and reports
Why Does It Matter?
- A/B testing โ Compare two versions of your AI
- Regression testing โ Make sure updates donโt break things
- Confidence โ Know your AI works before shipping
๐ค LLM-as-Judge Evaluation: Let AI Grade AI
What Is It?
Instead of writing complex rules to check answers, you use another AI to grade your AI! Itโs like having a senior student grade the younger studentsโ work.
Simple Example
from langsmith.evaluation import LangChainStringEvaluator
# Create an AI judge
judge = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"helpful": "Is this helpful?",
"accurate": "Is this correct?"
}
}
)
# The judge scores each answer
results = evaluate(
my_ai,
data="my-dataset",
evaluators=[judge]
)
Why Use AI as a Judge?
| Human Grading | AI Grading |
|---|---|
| Slow | Super fast |
| Expensive | Cheap |
| Inconsistent | Same rules every time |
| Limited scale | Grade thousands in minutes |
Common Evaluation Criteria
- โ Correctness โ Is the answer right?
- ๐ก Helpfulness โ Does it actually help?
- ๐ก๏ธ Safety โ Is it appropriate?
- ๐ Coherence โ Does it make sense?
๐ Annotation Queues: Human Review Station
What Is It?
Sometimes AI canโt judge everything. Annotation queues let humans review AI outputs when you need that human touch.
Think of it like a quality control conveyor belt where humans check products.
Simple Example
# Create a queue for human reviewers
queue = client.create_annotation_queue(
name="Review Tricky Cases",
description="Answers that need human check"
)
# Add items for review
client.add_runs_to_annotation_queue(
queue_id=queue.id,
run_ids=[run1.id, run2.id]
)
When to Use Human Review?
- ๐จ High-stakes answers โ Medical, legal, financial
- ๐ค Edge cases โ Weird inputs AI struggles with
- ๐ท๏ธ Training data โ Create gold-standard examples
- ๐ Spot checks โ Random quality audits
graph TD A["AI Answers Question"] --> B{Confidence High?} B -->|Yes| C["Auto-Approve"] B -->|No| D["Send to Queue"] D --> E["Human Reviews"] E --> F["Feedback to Dataset"]
๐ LangSmith Client API: Your Control Panel
What Is It?
The Client API is how you talk to LangSmith from your code. Itโs like a remote control for all LangSmith features.
Setting Up
from langsmith import Client
# Connect to LangSmith
client = Client(
api_key="your-api-key"
)
What Can You Do?
Manage Datasets:
# List all datasets
datasets = client.list_datasets()
# Get specific dataset
ds = client.read_dataset(
dataset_name="My Tests"
)
Work with Runs:
# Get all AI runs
runs = client.list_runs(
project_name="my-project",
filter='eq(status, "success")'
)
Get Feedback:
# Add feedback to a run
client.create_feedback(
run_id=run.id,
key="quality",
score=0.9,
comment="Great answer!"
)
Key API Features
| Feature | What It Does |
|---|---|
create_dataset |
Make new test set |
create_example |
Add test question |
list_runs |
See AI activity |
create_feedback |
Add scores/notes |
๐ Prompt Versioning: Track Your Changes
What Is It?
Every time you change your AIโs instructions (prompts), LangSmith saves a version. Like โSave Asโ but automatic!
Why Does It Matter?
Imagine your AI worked great yesterday but is broken today. With versioning, you can:
- ๐ Go back to the working version
- ๐ Compare old vs new
- ๐ฏ Find what change broke things
Simple Example
# Save a prompt version
prompt = client.push_prompt(
"support-assistant",
object={
"template": "You are a helpful..."
},
tags=["v2", "improved"]
)
# Later, get old version
old_prompt = client.pull_prompt(
"support-assistant:v1"
)
Version Control Best Practices
- Tag meaningful changes โ โadded-safety-rulesโ
- Test before promoting โ Run experiments first
- Keep notes โ Why did you change it?
graph TD A["Prompt v1"] --> B["Test Results: 70%"] B --> C["Edit Prompt"] C --> D["Prompt v2"] D --> E["Test Results: 85%"] E --> F["Promote v2 to Production"]
๐ Debugging and Logging: Find Problems Fast
What Is It?
When your AI does something weird, logging shows you exactly what happened, step by step. Itโs like a security camera recording everything.
What Gets Logged?
- ๐ฅ Input โ What the user asked
- ๐ง Chain steps โ Each part of your AIโs thinking
- ๐ค Output โ What your AI said
- โฑ๏ธ Timing โ How long each step took
- ๐ฐ Tokens โ How much it cost
Simple Example
import langsmith
# Enable tracing
langsmith.set_tracing_enabled(True)
# Now all your LangChain code
# is automatically logged!
Reading Logs in LangSmith
graph TD A["User Question"] --> B["Retrieval Step"] B --> C["Context Found"] C --> D["LLM Call"] D --> E["Final Answer"] style A fill:#e3f2fd style E fill:#c8e6c9
Each step shows:
- What went in
- What came out
- How long it took
- Any errors
Debugging Tips
| Problem | Where to Look |
|---|---|
| Wrong answer | Check LLM input |
| Slow response | Check step timings |
| High cost | Check token counts |
| Missing info | Check retrieval step |
Setting Log Detail Levels
# Verbose logging for debugging
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "debug-session"
๐ฏ Putting It All Together
Hereโs how all these pieces work as a team:
graph TD A["Build AI with LangChain"] --> B["Create Dataset"] B --> C["Run Experiment"] C --> D["LLM-as-Judge Scores"] D --> E{Good Enough?} E -->|No| F["Check Logs"] F --> G["Update Prompt"] G --> H["Save Version"] H --> C E -->|Yes| I["Ship It!"] J["Annotation Queue"] --> K["Human Feedback"] K --> B
Quick Reference
| Tool | What It Does | When to Use |
|---|---|---|
| Datasets | Store test cases | Before testing |
| Experiments | Run tests | When changing AI |
| LLM-as-Judge | Auto-grade answers | For scale |
| Annotation Queues | Human review | For accuracy |
| Client API | Control everything | In your code |
| Prompt Versioning | Track changes | Always |
| Debugging/Logging | Find problems | When stuck |
๐ You Did It!
You now understand how to:
- โ Create test datasets for your AI
- โ Run experiments to measure quality
- โ Use AI to grade AI outputs
- โ Set up human review workflows
- โ Control LangSmith with code
- โ Track prompt versions
- โ Debug with detailed logs
LangSmith Evaluation is your AIโs best friendโhelping you build AI thatโs not just smart, but reliably smart! ๐
