LangSmith Evaluation

Back

Loading concept...

๐Ÿ”ฌ LangSmith Evaluation: Your AI Quality Control Lab

Imagine youโ€™re a chef who just created a new recipe. How do you know if itโ€™s any good? You taste it, ask friends, write notes, and keep improving. LangSmith Evaluation does the same thingโ€”but for AI!


๐ŸŽฏ The Big Picture

LangSmith is like a science lab for your AI. When you build AI apps with LangChain, you need to know: โ€œIs my AI actually good?โ€ LangSmith helps you test, measure, and improve your AIโ€”just like a quality control team in a factory.

Think of it this way:

  • Your AI = A student taking tests
  • LangSmith = The teacher grading papers and giving feedback

๐Ÿ“ฆ LangSmith Datasets: Your Test Question Bank

What Is It?

A dataset is a collection of test questions for your AI. Just like a teacher keeps a folder of quiz questions, you keep a folder of inputs and expected outputs.

Simple Example

# Creating a dataset is like
# writing test questions
dataset = client.create_dataset(
    "Customer Support Tests",
    description="Questions customers ask"
)

# Each "example" is one test question
client.create_example(
    inputs={"question": "How do I reset?"},
    outputs={"answer": "Click Settings > Reset"},
    dataset_id=dataset.id
)

Why Does It Matter?

  • ๐Ÿ“ Reusable tests โ€” Run the same tests whenever you change your AI
  • ๐Ÿ“Š Track progress โ€” See if your AI gets better over time
  • ๐ŸŽฏ Catch problems โ€” Find issues before your users do
graph TD A["Create Dataset"] --> B["Add Examples"] B --> C["Input + Expected Output"] C --> D["Run Tests Anytime"] D --> E["Compare Results"]

๐Ÿงช LangSmith Experiments: Running Your Tests

What Is It?

An experiment is when you actually run your AI against your test questions. Itโ€™s like exam day for your AI!

Simple Example

from langsmith import evaluate

# Run your AI against all test questions
results = evaluate(
    my_ai_function,
    data="Customer Support Tests",
    evaluators=[accuracy_checker],
    experiment_prefix="v1-test"
)

What Happens During an Experiment?

  1. ๐ŸŽฌ Your AI answers each test question
  2. โœ… LangSmith checks each answer
  3. ๐Ÿ“ˆ You get scores and reports

Why Does It Matter?

  • A/B testing โ€” Compare two versions of your AI
  • Regression testing โ€” Make sure updates donโ€™t break things
  • Confidence โ€” Know your AI works before shipping

๐Ÿค– LLM-as-Judge Evaluation: Let AI Grade AI

What Is It?

Instead of writing complex rules to check answers, you use another AI to grade your AI! Itโ€™s like having a senior student grade the younger studentsโ€™ work.

Simple Example

from langsmith.evaluation import LangChainStringEvaluator

# Create an AI judge
judge = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "helpful": "Is this helpful?",
            "accurate": "Is this correct?"
        }
    }
)

# The judge scores each answer
results = evaluate(
    my_ai,
    data="my-dataset",
    evaluators=[judge]
)

Why Use AI as a Judge?

Human Grading AI Grading
Slow Super fast
Expensive Cheap
Inconsistent Same rules every time
Limited scale Grade thousands in minutes

Common Evaluation Criteria

  • โœ… Correctness โ€” Is the answer right?
  • ๐Ÿ’ก Helpfulness โ€” Does it actually help?
  • ๐Ÿ›ก๏ธ Safety โ€” Is it appropriate?
  • ๐Ÿ“– Coherence โ€” Does it make sense?

๐Ÿ“‹ Annotation Queues: Human Review Station

What Is It?

Sometimes AI canโ€™t judge everything. Annotation queues let humans review AI outputs when you need that human touch.

Think of it like a quality control conveyor belt where humans check products.

Simple Example

# Create a queue for human reviewers
queue = client.create_annotation_queue(
    name="Review Tricky Cases",
    description="Answers that need human check"
)

# Add items for review
client.add_runs_to_annotation_queue(
    queue_id=queue.id,
    run_ids=[run1.id, run2.id]
)

When to Use Human Review?

  • ๐Ÿšจ High-stakes answers โ€” Medical, legal, financial
  • ๐Ÿค” Edge cases โ€” Weird inputs AI struggles with
  • ๐Ÿท๏ธ Training data โ€” Create gold-standard examples
  • ๐Ÿ” Spot checks โ€” Random quality audits
graph TD A["AI Answers Question"] --> B{Confidence High?} B -->|Yes| C["Auto-Approve"] B -->|No| D["Send to Queue"] D --> E["Human Reviews"] E --> F["Feedback to Dataset"]

๐Ÿ”Œ LangSmith Client API: Your Control Panel

What Is It?

The Client API is how you talk to LangSmith from your code. Itโ€™s like a remote control for all LangSmith features.

Setting Up

from langsmith import Client

# Connect to LangSmith
client = Client(
    api_key="your-api-key"
)

What Can You Do?

Manage Datasets:

# List all datasets
datasets = client.list_datasets()

# Get specific dataset
ds = client.read_dataset(
    dataset_name="My Tests"
)

Work with Runs:

# Get all AI runs
runs = client.list_runs(
    project_name="my-project",
    filter='eq(status, "success")'
)

Get Feedback:

# Add feedback to a run
client.create_feedback(
    run_id=run.id,
    key="quality",
    score=0.9,
    comment="Great answer!"
)

Key API Features

Feature What It Does
create_dataset Make new test set
create_example Add test question
list_runs See AI activity
create_feedback Add scores/notes

๐Ÿ“ Prompt Versioning: Track Your Changes

What Is It?

Every time you change your AIโ€™s instructions (prompts), LangSmith saves a version. Like โ€œSave Asโ€ but automatic!

Why Does It Matter?

Imagine your AI worked great yesterday but is broken today. With versioning, you can:

  • ๐Ÿ”™ Go back to the working version
  • ๐Ÿ“Š Compare old vs new
  • ๐ŸŽฏ Find what change broke things

Simple Example

# Save a prompt version
prompt = client.push_prompt(
    "support-assistant",
    object={
        "template": "You are a helpful..."
    },
    tags=["v2", "improved"]
)

# Later, get old version
old_prompt = client.pull_prompt(
    "support-assistant:v1"
)

Version Control Best Practices

  1. Tag meaningful changes โ€” โ€œadded-safety-rulesโ€
  2. Test before promoting โ€” Run experiments first
  3. Keep notes โ€” Why did you change it?
graph TD A["Prompt v1"] --> B["Test Results: 70%"] B --> C["Edit Prompt"] C --> D["Prompt v2"] D --> E["Test Results: 85%"] E --> F["Promote v2 to Production"]

๐Ÿ” Debugging and Logging: Find Problems Fast

What Is It?

When your AI does something weird, logging shows you exactly what happened, step by step. Itโ€™s like a security camera recording everything.

What Gets Logged?

  • ๐Ÿ“ฅ Input โ€” What the user asked
  • ๐Ÿง  Chain steps โ€” Each part of your AIโ€™s thinking
  • ๐Ÿ“ค Output โ€” What your AI said
  • โฑ๏ธ Timing โ€” How long each step took
  • ๐Ÿ’ฐ Tokens โ€” How much it cost

Simple Example

import langsmith

# Enable tracing
langsmith.set_tracing_enabled(True)

# Now all your LangChain code
# is automatically logged!

Reading Logs in LangSmith

graph TD A["User Question"] --> B["Retrieval Step"] B --> C["Context Found"] C --> D["LLM Call"] D --> E["Final Answer"] style A fill:#e3f2fd style E fill:#c8e6c9

Each step shows:

  • What went in
  • What came out
  • How long it took
  • Any errors

Debugging Tips

Problem Where to Look
Wrong answer Check LLM input
Slow response Check step timings
High cost Check token counts
Missing info Check retrieval step

Setting Log Detail Levels

# Verbose logging for debugging
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "debug-session"

๐ŸŽฏ Putting It All Together

Hereโ€™s how all these pieces work as a team:

graph TD A["Build AI with LangChain"] --> B["Create Dataset"] B --> C["Run Experiment"] C --> D["LLM-as-Judge Scores"] D --> E{Good Enough?} E -->|No| F["Check Logs"] F --> G["Update Prompt"] G --> H["Save Version"] H --> C E -->|Yes| I["Ship It!"] J["Annotation Queue"] --> K["Human Feedback"] K --> B

Quick Reference

Tool What It Does When to Use
Datasets Store test cases Before testing
Experiments Run tests When changing AI
LLM-as-Judge Auto-grade answers For scale
Annotation Queues Human review For accuracy
Client API Control everything In your code
Prompt Versioning Track changes Always
Debugging/Logging Find problems When stuck

๐Ÿš€ You Did It!

You now understand how to:

  • โœ… Create test datasets for your AI
  • โœ… Run experiments to measure quality
  • โœ… Use AI to grade AI outputs
  • โœ… Set up human review workflows
  • โœ… Control LangSmith with code
  • โœ… Track prompt versions
  • โœ… Debug with detailed logs

LangSmith Evaluation is your AIโ€™s best friendโ€”helping you build AI thatโ€™s not just smart, but reliably smart! ๐ŸŽ‰

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.