What is imputation in data science?

Imputation means filling in missing values with smart guesses. Common methods include mean, median, and mode imputation.

What is an outlier and how do you handle it?

An outlier is a value way different from the rest. Investigate first to see if it's real, then keep, remove, or cap it.

Data Cleaning: Handle Missing Values | Data Science

Q: What is data cleaning?

Data cleaning is finding what's wrong with data and fixing it so we can use it. It handles missing entries, errors, and duplicates.

Data Cleaning: The Art of Making Data Sparkle ✨

The Messy Kitchen Analogy 🍳

Imagine you’re about to cook a delicious meal. But your kitchen is a mess! There are dirty dishes everywhere, some ingredients are spoiled, and others are in the wrong place.

You can’t cook a great meal in a messy kitchen.

Data cleaning works the same way. Before we can learn from data, we need to clean it up first!

What is Data Cleaning?

Data cleaning is like being a data detective and a data doctor at the same time.

Your job:

Find what’s wrong with the data
Fix it so we can use it

Real Life Example: Think about your contact list on your phone. Some contacts might have:

No phone number (missing!)
A wrong number (error!)
Same person saved twice (duplicate!)

Data cleaning fixes all these problems.

Why Does Data Get Dirty?

Data gets messy for many reasons:

graph TD
    A["Data Gets Dirty"] --> B["Human Mistakes"]
    A --> C["System Errors"]
    A --> D["Missing Entries"]
    A --> E["Merging Sources"]
    B --> F["Typos"]
    B --> G["Wrong Format"]
    C --> H["Crashes"]
    D --> I["Forgot to Fill"]
    E --> J["Different Styles"]

Simple Examples:

Someone types “ten” instead of “10”
A form is submitted with blank fields
Two databases store dates differently

Handling Missing Values 🕳️

The Empty Box Problem

Imagine you’re counting your toys. You have 5 boxes. But when you open them:

Box 1: 3 cars
Box 2: EMPTY! (missing)
Box 3: 2 dolls
Box 4: EMPTY! (missing)
Box 5: 4 blocks

What do you do with the empty boxes?

This is exactly what we face with missing data!

Types of Missing Data

1. Missing Completely at Random (MCAR)

Like when a coin falls under the couch. Pure accident. No pattern.

Example: Survey responses lost because the internet crashed randomly.

2. Missing at Random (MAR)

There’s a pattern, but it’s not about the missing value itself.

Example: Young people skip the “retirement plans” question. Age affects it, not retirement status.

3. Missing Not at Random (MNAR)

The reason it’s missing IS the answer.

Example: People don’t report income when they’re embarrassed about how much (or little) they make.

What Can We Do?

graph TD
    A["Missing Value Found!"] --> B{How Much Missing?}
    B -->|A Lot| C["Remove Row/Column"]
    B -->|A Little| D["Fill It In"]
    D --> E["Use Average"]
    D --> F["Use Most Common"]
    D --> G["Smart Prediction"]

Option 1: Delete It

If only a few rows have missing data, sometimes it’s easiest to just remove them.

When to use: Less than 5% of data is missing.

Option 2: Fill It In (Imputation)

Use smart guesses to fill the empty spots.

When to use: You can’t afford to lose any data.

Imputation Techniques 🔧

What is Imputation?

Imputation means filling in the blanks with smart guesses.

Think of it like this: Your friend is telling a story, but they mumble one word. You guess what it was based on the rest of the sentence!

Simple Imputation Methods

1. Mean Imputation (Average)

Fill missing numbers with the average of all other numbers.

Example: Test scores: 80, 90, ?, 70, 85

Step 1: Find average of known scores (80 + 90 + 70 + 85) ÷ 4 = 81.25

Step 2: Fill the blank with 81.25

Good for: Numbers that are spread evenly.

Bad for: Data with outliers (very high or low values).

2. Median Imputation (Middle Value)

Fill missing numbers with the middle value.

Example: Salaries: $30k, $35k, ?, $40k, $200k

The $200k is an outlier! Mean would be too high.

Median of known values: $37.5k (middle of 30, 35, 40, 200)

Good for: Data with outliers.

3. Mode Imputation (Most Common)

Fill missing values with the most frequent answer.

Example: Favorite colors: Red, Blue, Red, ?, Red, Blue

Most common = Red. Fill the blank with Red!

Good for: Categories (like colors, yes/no answers).

Advanced Imputation

K-Nearest Neighbors (KNN)

Look at similar data points. Use their values to guess.

Like this: You don’t know what movie your friend would like. You ask 5 friends with similar taste. 4 say “yes” to the movie. You guess your friend will like it too!

Regression Imputation

Use math to predict the missing value based on patterns.

Like this: Taller people usually weigh more. If we know someone’s height, we can guess their weight.

Handling Outliers 🚨

What’s an Outlier?

An outlier is a value that’s way different from the rest.

Example: Your class heights: 4ft, 4.2ft, 4.1ft, 4.3ft, 8ft

Wait… 8 feet tall? That’s an outlier! Either:

It’s a mistake (someone typed wrong)
It’s real but unusual (basketball player!)

Finding Outliers

graph TD
    A["Find Outliers"] --> B["Visual Methods"]
    A --> C["Math Methods"]
    B --> D["Box Plots"]
    B --> E["Scatter Plots"]
    C --> F["Z-Score"]
    C --> G["IQR Method"]

The Box Plot Method (IQR)

Imagine putting all numbers in order, then drawing a box around the middle 50%.

Anything far outside the box = outlier!

Rule: If a value is more than 1.5 × IQR away from the box, it’s an outlier.

The Z-Score Method

Measures how far a value is from the average.

Rule: If Z-score > 3 or < -3, it’s probably an outlier.

What To Do With Outliers?

1. Investigate First!

Don’t just delete. Ask: “Is this real?”

Example: A $0 sale might be:

Error (should be $100)
Real (a refund or free sample)

2. Options for Handling

Strategy	When to Use
Keep it	It’s real and important
Remove it	It’s clearly an error
Cap it	Replace extreme with max acceptable
Transform	Use log scale to reduce impact

Capping (Winsorizing)

Replace extreme values with a maximum limit.

Example: Ages: 25, 30, 28, 32, 150

Cap at 100: Ages become 25, 30, 28, 32, 100

Data Wrangling 🤠

What is Data Wrangling?

Data wrangling is the cowboy work of data science!

Like a cowboy wrangles horses into the corral, we wrangle messy data into a clean, organized format.

The Four Steps of Wrangling

graph TD
    A["Raw Data"] --> B["1. Discover"]
    B --> C["2. Structure"]
    C --> D["3. Clean"]
    D --> E["4. Enrich"]
    E --> F["Ready to Use!"]

1. Discover

Look at your data. Understand what you have.

Questions to ask:

How many rows and columns?
What types of data? (numbers, text, dates)
What’s missing?

2. Structure

Organize data into the right shape.

Example: You might need to:

Split one column into two (“John Smith” → “John” + “Smith”)
Combine columns (“City” + “Country” → “Location”)
Reshape from wide to long format

3. Clean

Fix all the problems we discussed:

Handle missing values
Fix outliers
Correct errors

4. Enrich

Add extra value:

Calculate new columns (age from birthdate)
Add external data (weather, holidays)
Create categories (group ages into “young”, “middle”, “old”)

Common Wrangling Tasks

Removing Duplicates

Same record appearing twice? Delete the extra!

Example:

Name	Email
John	j@mail.com
John	j@mail.com

Fixing Data Types

Numbers stored as text? Dates in wrong format?

Example:

“25” (text) → 25 (number)
“12/31/2023” → 2023-12-31

Standardizing Values

Same thing written differently?

Example:

“USA”, “U.S.A.”, “United States” → “USA”
“Male”, “M”, “male” → “Male”

The Wrangling Toolkit

Task	What It Does	Example
Filter	Keep only certain rows	Only adults
Sort	Order by a column	By date
Group	Combine similar items	By country
Join	Combine two tables	Add weather to sales
Pivot	Reshape data	Rows to columns

Putting It All Together 🎯

The Data Cleaning Workflow

graph TD
    A["Get Raw Data"] --> B["Explore &amp; Understand"]
    B --> C["Find Missing Values"]
    C --> D["Handle Missing Values"]
    D --> E["Detect Outliers"]
    E --> F["Handle Outliers"]
    F --> G["Wrangle &amp; Transform"]
    G --> H["Validate Results"]
    H --> I["Clean Data Ready!"]

Remember These Golden Rules

Always explore first - Look before you clean
Document everything - Write down what you changed
Never destroy original data - Keep a backup!
Question outliers - Don’t auto-delete
Validate after cleaning - Check your work

You Did It! 🎉

You now understand the fundamentals of data cleaning:

Data Cleaning = Making data usable
Missing Values = Empty spots we fill smartly
Imputation = Smart guessing techniques
Outliers = Unusual values to investigate
Data Wrangling = Organizing messy data

Remember: Clean data = Better insights = Smarter decisions!

Like a chef with a clean kitchen, you’re now ready to cook up some amazing data insights! 🍳📊

Data Cleaning

Unable to load concept

Coming Soon...

Data Cleaning: The Art of Making Data Sparkle ✨

The Messy Kitchen Analogy 🍳

What is Data Cleaning?

Why Does Data Get Dirty?

Handling Missing Values 🕳️

The Empty Box Problem

Types of Missing Data

1. Missing Completely at Random (MCAR)

2. Missing at Random (MAR)

3. Missing Not at Random (MNAR)

What Can We Do?

Option 1: Delete It

Option 2: Fill It In (Imputation)

Imputation Techniques 🔧

What is Imputation?

Simple Imputation Methods

1. Mean Imputation (Average)

2. Median Imputation (Middle Value)

3. Mode Imputation (Most Common)

Advanced Imputation

K-Nearest Neighbors (KNN)

Regression Imputation

Handling Outliers 🚨

What’s an Outlier?

Finding Outliers

The Box Plot Method (IQR)

The Z-Score Method

What To Do With Outliers?

1. Investigate First!

2. Options for Handling

Capping (Winsorizing)

Data Wrangling 🤠

What is Data Wrangling?

The Four Steps of Wrangling

1. Discover

2. Structure

3. Clean

4. Enrich

Common Wrangling Tasks

Removing Duplicates

Fixing Data Types

Standardizing Values

The Wrangling Toolkit

Putting It All Together 🎯

The Data Cleaning Workflow

Remember These Golden Rules

You Did It! 🎉

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue