Data Cleaning

Back

Loading concept...

đź§ą Data Cleaning in R: Cutting, Binning & Missing Values

The Messy Room Story

Imagine your room is super messy. Toys everywhere! Books scattered! Some things are even missing (where did that sock go?). Before you can play, you need to clean up and organize.

Data cleaning is the same! Real-world data is messy. Numbers are all over the place, and some values just… disappear. Today, we’ll learn two superpowers:

  1. Cutting & Binning → Organizing scattered numbers into neat groups
  2. Missing Value Handling → Finding and fixing the “lost socks” in your data

🎯 Part 1: Cutting and Binning

What’s the Big Idea?

Think of a toy sorting box with different compartments labeled “Small,” “Medium,” and “Large.” Instead of having 100 different toy sizes, you just sort them into 3 groups!

graph TD A["Messy Numbers: 5, 23, 67, 12, 89, 45"] --> B["Sorting Box"] B --> C["Low: 5, 12"] B --> D["Medium: 23, 45"] B --> E["High: 67, 89"]

Why do this?

  • Makes patterns easier to see
  • Simplifies analysis
  • Groups similar things together

The cut() Function - Your Sorting Tool

The cut() function is like having a magical sorting machine!

Basic Recipe:

cut(x, breaks, labels)
Part What It Does
x Your messy numbers
breaks Where to make the cuts
labels Names for each group

Example 1: Sorting Ages into Groups

Imagine you have ages of kids at a party:

# Kids' ages
ages <- c(5, 8, 12, 7, 15, 3, 10)

# Sort into groups
age_groups <- cut(
  ages,
  breaks = c(0, 6, 12, 18),
  labels = c("Little", "Medium", "Teen")
)

print(age_groups)
# Little, Medium, Medium,
# Medium, Teen, Little, Medium

What happened?

  • Ages 0-6 → “Little”
  • Ages 7-12 → “Medium”
  • Ages 13-18 → “Teen”

Example 2: Test Scores to Grades

scores <- c(92, 78, 65, 88, 45, 95)

grades <- cut(
  scores,
  breaks = c(0, 60, 70, 80, 90, 100),
  labels = c("F", "D", "C", "B", "A")
)

print(grades)
# A, C, D, B, F, A

Now messy numbers become clear letter grades!


The include.lowest Secret

By default, cut() doesn’t include the lowest number. Fix this:

# Include the minimum value
cut(ages,
    breaks = c(0, 6, 12, 18),
    labels = c("Little", "Medium", "Teen"),
    include.lowest = TRUE)

Quick Binning with ntile()

Need equal-sized groups? Use ntile() from dplyr:

library(dplyr)

# Split into 3 equal groups
ages <- c(5, 8, 12, 7, 15, 3, 10)
ntile(ages, 3)
# 1, 2, 3, 2, 3, 1, 2

Each group gets roughly the same number of items!


🕳️ Part 2: Missing Value Handling

The Mystery of NA

In R, missing values are shown as NA (Not Available). It’s like a blank space where a number should be.

Where do NAs come from?

  • Someone forgot to fill in a form
  • A sensor broke
  • Data got lost during transfer
# A vector with missing values
temps <- c(72, NA, 68, 75, NA, 70)

Finding Missing Values

Question: “Do I have any missing socks… I mean, values?”

# Check each value
is.na(temps)
# FALSE, TRUE, FALSE, FALSE, TRUE, FALSE

# Count the missing ones
sum(is.na(temps))
# 2

# Are ANY missing?
any(is.na(temps))
# TRUE
graph TD A["Your Data"] --> B{is.na?} B -->|TRUE| C["Missing! 🕳️"] B -->|FALSE| D["Got it! ✓"]

Strategy 1: Remove Missing Values

Sometimes the easiest fix is to just skip the NAs.

temps <- c(72, NA, 68, 75, NA, 70)

# Remove NAs completely
clean_temps <- na.omit(temps)
print(clean_temps)
# 72, 68, 75, 70

# Or use complete.cases for data frames
df <- data.frame(
  name = c("Ana", "Bob", "Cat"),
  age = c(10, NA, 8)
)
df[complete.cases(df), ]

When to use: You have lots of data and losing a few rows is okay.


Strategy 2: Replace with a Fixed Value

Fill the gaps with a specific number:

temps <- c(72, NA, 68, 75, NA, 70)

# Replace NA with 0
temps[is.na(temps)] <- 0

# Or use tidyr's replace_na
library(tidyr)
temps <- c(72, NA, 68, 75, NA, 70)
replace_na(temps, 0)
# 72, 0, 68, 75, 0, 70

Strategy 3: Replace with Mean/Median

Fill gaps with the average value - a smart guess!

temps <- c(72, NA, 68, 75, NA, 70)

# Calculate mean (ignoring NAs)
avg_temp <- mean(temps, na.rm = TRUE)
# avg_temp = 71.25

# Fill NAs with average
temps[is.na(temps)] <- avg_temp
# 72, 71.25, 68, 75, 71.25, 70

The na.rm = TRUE trick: Most R functions fail with NAs. Add na.rm = TRUE to ignore them!

mean(c(1, NA, 3))        # NA (broken!)
mean(c(1, NA, 3), na.rm=TRUE)  # 2 (works!)

Strategy 4: Forward/Backward Fill

Use the previous (or next) value to fill gaps:

library(tidyr)

temps <- c(72, NA, NA, 75, NA, 70)

# Fill with previous value
fill(data.frame(t=temps), t,
     .direction = "down")
# 72, 72, 72, 75, 75, 70

# Fill with next value
fill(data.frame(t=temps), t,
     .direction = "up")
# 72, 75, 75, 75, 70, 70

When to use: Time series data where values flow naturally.


đź§© Putting It All Together

Here’s a real example combining both skills:

library(dplyr)
library(tidyr)

# Messy student data
students <- data.frame(
  name = c("Ava", "Ben", "Cara", "Dan"),
  score = c(85, NA, 72, 91)
)

# Step 1: Handle missing scores
students <- students %>%
  mutate(score = replace_na(
    score,
    mean(score, na.rm = TRUE)
  ))

# Step 2: Bin into grade groups
students <- students %>%
  mutate(grade = cut(
    score,
    breaks = c(0, 60, 70, 80, 90, 100),
    labels = c("F","D","C","B","A")
  ))

print(students)

🎯 Quick Reference

Task Function Example
Create bins cut() cut(x, breaks=c(0,50,100))
Equal groups ntile() ntile(x, 4)
Find NAs is.na() is.na(x)
Remove NAs na.omit() na.omit(x)
Replace NAs replace_na() replace_na(x, 0)
Ignore NAs na.rm=TRUE mean(x, na.rm=TRUE)

🏆 You Did It!

You now have two powerful data cleaning superpowers:

âś… Cutting & Binning - Transform messy numbers into organized groups âś… Missing Value Handling - Find and fix the gaps in your data

Remember: Clean data = Happy analysis! 🎉

Just like cleaning your room makes it easier to find your toys, cleaning your data makes it easier to find insights!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.