đź§ą Data Cleaning in R: Cutting, Binning & Missing Values
The Messy Room Story
Imagine your room is super messy. Toys everywhere! Books scattered! Some things are even missing (where did that sock go?). Before you can play, you need to clean up and organize.
Data cleaning is the same! Real-world data is messy. Numbers are all over the place, and some values just… disappear. Today, we’ll learn two superpowers:
- Cutting & Binning → Organizing scattered numbers into neat groups
- Missing Value Handling → Finding and fixing the “lost socks” in your data
🎯 Part 1: Cutting and Binning
What’s the Big Idea?
Think of a toy sorting box with different compartments labeled “Small,” “Medium,” and “Large.” Instead of having 100 different toy sizes, you just sort them into 3 groups!
graph TD A["Messy Numbers: 5, 23, 67, 12, 89, 45"] --> B["Sorting Box"] B --> C["Low: 5, 12"] B --> D["Medium: 23, 45"] B --> E["High: 67, 89"]
Why do this?
- Makes patterns easier to see
- Simplifies analysis
- Groups similar things together
The cut() Function - Your Sorting Tool
The cut() function is like having a magical sorting machine!
Basic Recipe:
cut(x, breaks, labels)
| Part | What It Does |
|---|---|
x |
Your messy numbers |
breaks |
Where to make the cuts |
labels |
Names for each group |
Example 1: Sorting Ages into Groups
Imagine you have ages of kids at a party:
# Kids' ages
ages <- c(5, 8, 12, 7, 15, 3, 10)
# Sort into groups
age_groups <- cut(
ages,
breaks = c(0, 6, 12, 18),
labels = c("Little", "Medium", "Teen")
)
print(age_groups)
# Little, Medium, Medium,
# Medium, Teen, Little, Medium
What happened?
- Ages 0-6 → “Little”
- Ages 7-12 → “Medium”
- Ages 13-18 → “Teen”
Example 2: Test Scores to Grades
scores <- c(92, 78, 65, 88, 45, 95)
grades <- cut(
scores,
breaks = c(0, 60, 70, 80, 90, 100),
labels = c("F", "D", "C", "B", "A")
)
print(grades)
# A, C, D, B, F, A
Now messy numbers become clear letter grades!
The include.lowest Secret
By default, cut() doesn’t include the lowest number. Fix this:
# Include the minimum value
cut(ages,
breaks = c(0, 6, 12, 18),
labels = c("Little", "Medium", "Teen"),
include.lowest = TRUE)
Quick Binning with ntile()
Need equal-sized groups? Use ntile() from dplyr:
library(dplyr)
# Split into 3 equal groups
ages <- c(5, 8, 12, 7, 15, 3, 10)
ntile(ages, 3)
# 1, 2, 3, 2, 3, 1, 2
Each group gets roughly the same number of items!
🕳️ Part 2: Missing Value Handling
The Mystery of NA
In R, missing values are shown as NA (Not Available). It’s like a blank space where a number should be.
Where do NAs come from?
- Someone forgot to fill in a form
- A sensor broke
- Data got lost during transfer
# A vector with missing values
temps <- c(72, NA, 68, 75, NA, 70)
Finding Missing Values
Question: “Do I have any missing socks… I mean, values?”
# Check each value
is.na(temps)
# FALSE, TRUE, FALSE, FALSE, TRUE, FALSE
# Count the missing ones
sum(is.na(temps))
# 2
# Are ANY missing?
any(is.na(temps))
# TRUE
graph TD A["Your Data"] --> B{is.na?} B -->|TRUE| C["Missing! 🕳️"] B -->|FALSE| D["Got it! ✓"]
Strategy 1: Remove Missing Values
Sometimes the easiest fix is to just skip the NAs.
temps <- c(72, NA, 68, 75, NA, 70)
# Remove NAs completely
clean_temps <- na.omit(temps)
print(clean_temps)
# 72, 68, 75, 70
# Or use complete.cases for data frames
df <- data.frame(
name = c("Ana", "Bob", "Cat"),
age = c(10, NA, 8)
)
df[complete.cases(df), ]
When to use: You have lots of data and losing a few rows is okay.
Strategy 2: Replace with a Fixed Value
Fill the gaps with a specific number:
temps <- c(72, NA, 68, 75, NA, 70)
# Replace NA with 0
temps[is.na(temps)] <- 0
# Or use tidyr's replace_na
library(tidyr)
temps <- c(72, NA, 68, 75, NA, 70)
replace_na(temps, 0)
# 72, 0, 68, 75, 0, 70
Strategy 3: Replace with Mean/Median
Fill gaps with the average value - a smart guess!
temps <- c(72, NA, 68, 75, NA, 70)
# Calculate mean (ignoring NAs)
avg_temp <- mean(temps, na.rm = TRUE)
# avg_temp = 71.25
# Fill NAs with average
temps[is.na(temps)] <- avg_temp
# 72, 71.25, 68, 75, 71.25, 70
The na.rm = TRUE trick:
Most R functions fail with NAs. Add na.rm = TRUE to ignore them!
mean(c(1, NA, 3)) # NA (broken!)
mean(c(1, NA, 3), na.rm=TRUE) # 2 (works!)
Strategy 4: Forward/Backward Fill
Use the previous (or next) value to fill gaps:
library(tidyr)
temps <- c(72, NA, NA, 75, NA, 70)
# Fill with previous value
fill(data.frame(t=temps), t,
.direction = "down")
# 72, 72, 72, 75, 75, 70
# Fill with next value
fill(data.frame(t=temps), t,
.direction = "up")
# 72, 75, 75, 75, 70, 70
When to use: Time series data where values flow naturally.
đź§© Putting It All Together
Here’s a real example combining both skills:
library(dplyr)
library(tidyr)
# Messy student data
students <- data.frame(
name = c("Ava", "Ben", "Cara", "Dan"),
score = c(85, NA, 72, 91)
)
# Step 1: Handle missing scores
students <- students %>%
mutate(score = replace_na(
score,
mean(score, na.rm = TRUE)
))
# Step 2: Bin into grade groups
students <- students %>%
mutate(grade = cut(
score,
breaks = c(0, 60, 70, 80, 90, 100),
labels = c("F","D","C","B","A")
))
print(students)
🎯 Quick Reference
| Task | Function | Example |
|---|---|---|
| Create bins | cut() |
cut(x, breaks=c(0,50,100)) |
| Equal groups | ntile() |
ntile(x, 4) |
| Find NAs | is.na() |
is.na(x) |
| Remove NAs | na.omit() |
na.omit(x) |
| Replace NAs | replace_na() |
replace_na(x, 0) |
| Ignore NAs | na.rm=TRUE |
mean(x, na.rm=TRUE) |
🏆 You Did It!
You now have two powerful data cleaning superpowers:
âś… Cutting & Binning - Transform messy numbers into organized groups âś… Missing Value Handling - Find and fix the gaps in your data
Remember: Clean data = Happy analysis! 🎉
Just like cleaning your room makes it easier to find your toys, cleaning your data makes it easier to find insights!
