What is training data in machine learning?

Training data is what the computer uses to learn patterns. It's usually 70-80% of all data, like a student practicing with example problems.

What's the difference between features and labels?

Features are input information (what we know), like house size. Labels are outputs we want to predict (the answer), like house price.

How should you split data for machine learning?

Split into training (70-80%), validation (10-15%), and test (10-20%). Never peek at test data during training for accurate results.

Working with Data in ML | Machine Learning

🍕 Working with Data in Machine Learning

Your Pizza Kitchen Adventure

Imagine you want to become the best pizza chef in town. But you’ve never made pizza before! How do you learn?

You practice with pizzas. Lots of them.

Machine Learning works the same way. The computer is like a chef learning to cook. And data is the ingredients it uses to practice.

Let’s explore how we prepare data for our ML chef!

🎯 The Big Picture

graph TD
    A["📦 All Your Data"] --> B["🎓 Training Data"]
    A --> C["🧪 Validation Data"]
    A --> D["📝 Test Data"]
    B --> E["🤖 ML Model Learns"]
    C --> F["🔧 Tune &amp; Improve"]
    D --> G["✅ Final Grade"]

Think of it like cooking school:

Training = Practice making pizzas
Validation = Taste-test while learning
Test = Final exam with mystery ingredients

🎓 Training Data

Your Practice Kitchen

What is it? Training data is what the computer uses to learn patterns. It’s like a student practicing with hundreds of example problems.

Pizza Analogy: You make 100 pizzas. Some are good, some are burnt. You learn from EACH one. That’s training!

Simple Example:

Training a spam filter:
📧 "You won $1000!" → SPAM
📧 "Meeting at 3pm" → NOT SPAM
📧 "Click here FREE!" → SPAM
📧 "Lunch tomorrow?" → NOT SPAM

The computer sees these examples and learns: “Hmm, words like ‘FREE’ and ‘won’ often mean spam!”

Key Point:

Usually 70-80% of all data goes here
More training data = better learning (usually!)
Quality matters more than quantity

🧪 Validation Data

Your Taste-Testing Station

What is it? Validation data helps you check progress while still learning. It’s like having a friend taste your pizza while you’re still in cooking school.

Pizza Analogy: Your friend tries each pizza you make. They say “too salty!” or “perfect!” You adjust your recipe based on their feedback.

Simple Example:

Training: Learn from 800 emails
Validation: Check with 100 emails

"Is the spam filter getting better?"
Week 1: 60% accurate ❌
Week 2: 75% accurate 🔄
Week 3: 90% accurate ✅

Key Point:

Usually 10-15% of all data
Used to tune your model
Helps prevent mistakes before the final test

📝 Test Data

Your Final Exam

What is it? Test data is the final check. The model has NEVER seen this data before. It’s the true test of what it learned.

Pizza Analogy: A food critic comes to your restaurant. They order a pizza you’ve NEVER made before. Can you still make it delicious?

Simple Example:

After training the spam filter:

NEW emails it never saw:
📧 "Claim your prize!" → Model says: SPAM ✅
📧 "Project update" → Model says: NOT SPAM ✅
📧 "FREE gift card" → Model says: SPAM ✅

Test Score: 3/3 = 100% 🎉

Key Point:

Usually 10-20% of all data
Never peek at test data during training!
This gives the TRUE accuracy score

✂️ Data Splitting Strategies

How to Slice Your Pizza Data

There are different ways to divide your data. Let’s explore!

1. Simple Split (Hold-Out)

The easiest method. Just divide once.

graph LR
    A["100% Data"] --> B["70% Train"]
    A --> C["15% Validation"]
    A --> D["15% Test"]

Example: You have 1000 cat/dog photos:

700 for training
150 for validation
150 for test

Best for: Large datasets

2. K-Fold Cross-Validation

Rotate which data is used for training and testing.

graph TD
    A["Data Split into 5 Parts"] --> B["Round 1: Part 1 tests"]
    B --> C["Round 2: Part 2 tests"]
    C --> D["Round 3: Part 3 tests"]
    D --> E["Round 4: Part 4 tests"]
    E --> F["Round 5: Part 5 tests"]
    F --> G["Average All Scores"]

Pizza Analogy: Every chef in the kitchen takes turns being the “judge.” Everyone judges AND cooks. Fair for all!

Best for: Small datasets (when you can’t waste any data)

3. Stratified Splitting

Keep the same mix in all parts.

Example: Your data has:

80 cats
20 dogs

Without stratifying:

Training might get 75 cats, 5 dogs
Test might get 5 cats, 15 dogs 😰

With stratifying:

Training: 64 cats, 16 dogs (80/20 ratio ✅)
Test: 16 cats, 4 dogs (80/20 ratio ✅)

Best for: Imbalanced data

🧩 Features and Feature Vectors

The Ingredients List

What is a Feature? A feature is one piece of information about something. It’s like one ingredient in a recipe.

Example - Predicting House Prices:

Feature	Value
Bedrooms	3
Bathrooms	2
Square feet	1500
Age (years)	10

Each column is ONE feature.

What is a Feature Vector? A feature vector is ALL features together as a list.

Example:

House 1: [3, 2, 1500, 10]
House 2: [4, 3, 2000, 5]
House 3: [2, 1, 900, 30]

Pizza Analogy: Each pizza has a feature vector:

Margherita: [tomato, mozzarella, basil, thin_crust]
Pepperoni: [tomato, mozzarella, pepperoni, regular_crust]

The computer looks at these features to find patterns!

🏷️ Labels and Target Variables

The Answer Key

What is a Label? The label is what we want to predict. It’s the “answer” for each example.

Pizza Analogy: You show the computer pictures of food. The LABEL tells it what each food is:

🍕 Picture 1 → Label: “Pizza”
🍔 Picture 2 → Label: “Burger”
🌮 Picture 3 → Label: “Taco”

Features vs Labels:

Features (Input)	Label (Output)
Size, bedrooms, location	House Price
Words in email	Spam or Not
Patient symptoms	Disease name
Weather conditions	Rain tomorrow?

Example:

Email: "Click here to win FREE money!"
Features: [has_click, has_free, has_money, has_exclamation]
         [true, true, true, true]
Label: SPAM

Key Point:

Features = what we KNOW (input)
Label = what we want to FIND (output)
Training data has BOTH
Test data: we hide the labels to check predictions!

🔢 Continuous vs Categorical Variables

Numbers vs Categories

Continuous Variables

Numbers that can be ANY value on a scale.

Examples:

Temperature: 72.5°F, 73.1°F, 68.9°F
Height: 5.2 feet, 6.1 feet
Price: $25.99, $100.50
Time: 3.5 hours

Key trait: You can have values “in between” (like 72.5 degrees)

Categorical Variables

Groups or categories - no “in between” values.

Examples:

Colors: Red, Blue, Green
Animal type: Cat, Dog, Bird
T-shirt size: S, M, L, XL
Weather: Sunny, Rainy, Cloudy

Key trait: Things either ARE or ARE NOT in a category

Special Case: Ordinal Categorical

Categories with a natural order.

Examples:

Education: High School → Bachelor’s → Master’s → PhD
Satisfaction: Unhappy → Neutral → Happy
Size: Small → Medium → Large

Quick Comparison

Type	Example	Math Operations
Continuous	Temperature: 72.5°F	Can add, average
Categorical	Color: “Red”	Cannot add
Ordinal	Size: “Medium”	Can compare order

Pizza Analogy:

Continuous: Pizza diameter = 12.5 inches
Categorical: Pizza type = “Margherita”
Ordinal: Spice level = “Mild” < “Medium” < “Hot”

🎯 Putting It All Together

Let’s see a real example!

Goal: Predict if a student will pass an exam.

Step 1: Collect Data

Hours Studied	Sleep Hours	Practice Tests	Passed?
10	8	3	Yes
2	4	0	No
8	7	2	Yes
…	…	…	…

Step 2: Identify Components

Features: Hours Studied, Sleep Hours, Practice Tests
Feature Vector: [10, 8, 3]
Label: Passed? (Yes/No)
Variable Types:
- Hours Studied = Continuous
- Practice Tests = Continuous (could be ordinal)
- Passed = Categorical (Yes or No)

Step 3: Split Data

80 students total
Training: 56 students (70%)
Validation: 12 students (15%)
Test: 12 students (15%)

Step 4: Train, Validate, Test!

🌟 Key Takeaways

Training Data = Your practice problems (70-80%)
Validation Data = Progress checks (10-15%)
Test Data = Final exam, never peek! (10-20%)
Features = Input information (ingredients)
Labels = The answers we want to predict
Continuous = Numbers on a scale (temperature)
Categorical = Groups/Categories (colors)

🚀 You’re Ready!

You now understand how to work with data in Machine Learning. Just like a pizza chef needs good ingredients to make great pizza, an ML model needs well-prepared data to make great predictions!

Remember: Good data = Good predictions. Bad data = Bad predictions. It’s that simple!

Now go slice up some data and start cooking with ML! 🍕🤖

Working with Data

Unable to load concept

Coming Soon...

🍕 Working with Data in Machine Learning

Your Pizza Kitchen Adventure

🎯 The Big Picture

🎓 Training Data

Your Practice Kitchen

🧪 Validation Data

Your Taste-Testing Station

📝 Test Data

Your Final Exam

✂️ Data Splitting Strategies

How to Slice Your Pizza Data

1. Simple Split (Hold-Out)

2. K-Fold Cross-Validation

3. Stratified Splitting

🧩 Features and Feature Vectors

The Ingredients List

🏷️ Labels and Target Variables

The Answer Key

🔢 Continuous vs Categorical Variables

Numbers vs Categories

Continuous Variables

Categorical Variables

Special Case: Ordinal Categorical

Quick Comparison

🎯 Putting It All Together

🌟 Key Takeaways

🚀 You’re Ready!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue