๐งน The Data Kitchen: Mastering dplyr
Imagine you have a messy kitchen full of ingredients. dplyr is your magical chefโs toolkit that helps you pick, clean, organize, and combine ingredients to cook amazing data dishes!
๐ฏ What is dplyr?
Think of dplyr like a super-organized chef in a kitchen:
- The kitchen is your data table (rows and columns)
- Each row is one dish youโre preparing
- Each column is an ingredient or detail about the dish
The chef (dplyr) has special tools to:
- Pick only the ingredients you need
- Keep only the dishes that match your order
- Add new ingredients to dishes
- Sort dishes in order
- Count how many dishes you made
- Combine ingredients from different shelves
library(tidyverse) # Load dplyr and friends
๐ง SELECT: Pick Your Columns
The Ingredient Picker ๐ฅ
Imagine a table with 100 columns. You only need 3. SELECT is like saying: โGive me just the carrots, onions, and tomatoes!โ
Basic Select
# Original: name, age, city, job, salary
data %>%
select(name, age, city)
# Result: just name, age, city
Select Multiple at Once
# Pick a range of columns
data %>%
select(name:city)
# Pick all EXCEPT some
data %>%
select(-salary, -job)
Smart Selection Helpers
# Columns that START with something
select(starts_with("date"))
# Columns that END with something
select(ends_with("_id"))
# Columns that CONTAIN something
select(contains("price"))
# Pick by position
select(1:3) # First 3 columns
graph TD A["Original Table: 10 columns"] --> B["select name, age"] B --> C["Result: 2 columns"] style C fill:#90EE90
๐ FILTER: Keep Only What You Want
The Order Checker ๐
A restaurant gets many orders. FILTER is like saying: โShow me only the pizza orders!โ or โOnly orders over $20!โ
Basic Filter
# Keep rows where age is above 18
data %>%
filter(age > 18)
# Keep rows where city is exactly "Paris"
data %>%
filter(city == "Paris")
Multiple Conditions
# AND: both must be true
data %>%
filter(age > 18, city == "Paris")
# Same thing with &
data %>%
filter(age > 18 & city == "Paris")
# OR: either can be true
data %>%
filter(city == "Paris" | city == "London")
Handy Filter Tricks
# Check if value is in a list
data %>%
filter(city %in% c("Paris", "London", "Tokyo"))
# Remove missing values
data %>%
filter(!is.na(age))
# Between two values
data %>%
filter(between(age, 18, 65))
graph TD A["100 rows of data"] --> B{filter age > 18} B -->|Keep| C["45 rows match"] B -->|Remove| D["55 rows gone"] style C fill:#90EE90 style D fill:#FFB6C1
โจ MUTATE: Create New Columns
The Recipe Inventor ๐จโ๐ณ
You have flour and water. MUTATE lets you create something new: dough! It takes existing columns and makes new ones.
Basic Mutate
# Create a new column
data %>%
mutate(age_in_months = age * 12)
# Create multiple at once
data %>%
mutate(
age_in_months = age * 12,
is_adult = age >= 18
)
Mutate with Conditions
# Use if_else for simple yes/no
data %>%
mutate(
status = if_else(age >= 18, "Adult", "Minor")
)
# Use case_when for multiple options
data %>%
mutate(
age_group = case_when(
age < 13 ~ "Child",
age < 20 ~ "Teen",
age < 60 ~ "Adult",
TRUE ~ "Senior"
)
)
Modify Existing Columns
# Overwrite a column
data %>%
mutate(name = toupper(name))
# Round numbers
data %>%
mutate(price = round(price, 2))
๐ ARRANGE: Sort Your Data
The Organizer ๐
Like sorting books on a shelf! ARRANGE puts your rows in order.
Basic Arrange
# Sort A to Z (ascending)
data %>%
arrange(name)
# Sort Z to A (descending)
data %>%
arrange(desc(name))
Multiple Sort Levels
# Sort by city, then by age within each city
data %>%
arrange(city, age)
# City A-Z, but age highest first
data %>%
arrange(city, desc(age))
graph TD A["Unsorted Data"] --> B["arrange by age"] B --> C["10, 15, 22, 35, 48"] A --> D["arrange desc age"] D --> E["48, 35, 22, 15, 10"] style C fill:#90EE90 style E fill:#87CEEB
๐ท๏ธ RENAME: Give Columns New Names
The Label Maker โ๏ธ
Sometimes column names are confusing. RENAME lets you give them better names!
Basic Rename
# New name = old name
data %>%
rename(customer_name = name)
# Rename multiple
data %>%
rename(
customer_name = name,
customer_age = age
)
Rename with Patterns
# Add prefix to all columns
data %>%
rename_with(~ paste0("col_", .))
# Make all lowercase
data %>%
rename_with(tolower)
# Replace spaces with underscores
data %>%
rename_with(~ gsub(" ", "_", .))
๐ข COUNT: How Many of Each?
The Tally Keeper ๐
How many pizzas did we sell? How many customers from each city? COUNT tells you!
Basic Count
# Count rows for each city
data %>%
count(city)
# Result: city, n (number)
# Count combinations
data %>%
count(city, is_adult)
Count with Sorting
# Most common first
data %>%
count(city, sort = TRUE)
# Custom column name
data %>%
count(city, name = "total")
Add Count to Data
# Keep all rows, add count column
data %>%
add_count(city, name = "city_total")
๐ SUMMARIZE: Calculate Totals
The Accountant ๐ฐ
โWhatโs the total sales?โ โWhatโs the average age?โ SUMMARIZE crunches many rows into one answer.
Basic Summarize
# One number from many rows
data %>%
summarize(
total_sales = sum(sales),
avg_age = mean(age),
count = n()
)
Common Summary Functions
data %>%
summarize(
total = sum(x), # Add up
average = mean(x), # Average
middle = median(x), # Middle value
smallest = min(x), # Minimum
biggest = max(x), # Maximum
spread = sd(x), # Spread out?
how_many = n(), # Count rows
unique = n_distinct(x) # Unique values
)
๐ฅ GROUP BY: Divide and Conquer
The Team Captain ๐
Instead of one total for everyone, get totals for each GROUP! Like: โSales per cityโ or โAverage age per country.โ
Basic Group By
# Total sales by city
data %>%
group_by(city) %>%
summarize(total = sum(sales))
Multiple Groups
# Sales by city AND year
data %>%
group_by(city, year) %>%
summarize(
total = sum(sales),
avg = mean(sales)
)
Donโt Forget to Ungroup!
# After grouping, ungroup for safety
data %>%
group_by(city) %>%
summarize(total = sum(sales)) %>%
ungroup()
graph TD A["All Data"] --> B["group_by city"] B --> C["Paris group"] B --> D["London group"] B --> E["Tokyo group"] C --> F["summarize each"] D --> F E --> F F --> G["One row per city"] style G fill:#90EE90
๐ ACROSS: Apply to Many Columns
The Multi-Tool ๐ ๏ธ
Want to do the same thing to many columns at once? ACROSS is your friend!
Basic Across
# Round all numeric columns
data %>%
mutate(across(where(is.numeric), round))
# Convert selected columns to uppercase
data %>%
mutate(across(c(name, city), toupper))
Across with Selection Helpers
# All columns starting with "price"
data %>%
mutate(across(starts_with("price"), ~ . * 1.1))
# All character columns
data %>%
mutate(across(where(is.character), tolower))
Across in Summarize
# Mean of all numeric columns
data %>%
summarize(across(where(is.numeric), mean))
# Multiple summaries per column
data %>%
summarize(across(
c(sales, profit),
list(total = sum, avg = mean)
))
๐ TABLE JOINS: Combine Two Tables
The Matchmaker ๐
You have customers in one table and orders in another. JOINS connect them!
Left Join (Most Common)
# Keep ALL from left, match from right
customers %>%
left_join(orders, by = "customer_id")
Right Join
# Keep ALL from right, match from left
customers %>%
right_join(orders, by = "customer_id")
Inner Join
# Keep ONLY matching rows
customers %>%
inner_join(orders, by = "customer_id")
Full Join
# Keep EVERYTHING from both
customers %>%
full_join(orders, by = "customer_id")
Join by Different Column Names
# When names don't match
data1 %>%
left_join(data2, by = c("id" = "customer_id"))
graph TD A["Customers Table"] --> E["LEFT JOIN"] B["Orders Table"] --> E E --> F["All customers + their orders"] A --> G["INNER JOIN"] B --> G G --> H["Only customers with orders"] style F fill:#90EE90 style H fill:#87CEEB
๐ ROW OPERATIONS: Work with Rows
The Row Specialist ๐
Sometimes you need to slice, number, or work with specific rows.
Slice: Pick Specific Rows
# First 5 rows
data %>%
slice(1:5)
# Last 3 rows
data %>%
slice_tail(n = 3)
# Top 10 by sales
data %>%
slice_max(sales, n = 10)
# Bottom 5 by price
data %>%
slice_min(price, n = 5)
# Random sample
data %>%
slice_sample(n = 100)
Row Numbers
# Add row number column
data %>%
mutate(row_num = row_number())
# Number within groups
data %>%
group_by(city) %>%
mutate(rank = row_number())
Distinct Rows
# Remove duplicate rows
data %>%
distinct()
# Distinct by specific columns
data %>%
distinct(city, .keep_all = TRUE)
Row-wise Operations
# Calculate across columns per row
data %>%
rowwise() %>%
mutate(row_sum = sum(c(col1, col2, col3)))
๐ช Putting It All Together
Hereโs a real example using everything:
library(tidyverse)
# Start with raw sales data
sales_data %>%
# Pick columns we need
select(date, store, product, quantity, price) %>%
# Keep only 2024 data
filter(year(date) == 2024) %>%
# Calculate total per row
mutate(total = quantity * price) %>%
# Group by store
group_by(store) %>%
# Get summaries
summarize(
orders = n(),
revenue = sum(total),
avg_order = mean(total)
) %>%
# Sort by revenue
arrange(desc(revenue)) %>%
# Clean column names
rename(
store_name = store,
total_revenue = revenue
)
๐ Quick Reference
| Task | Function | Example |
|---|---|---|
| Pick columns | select() |
select(name, age) |
| Filter rows | filter() |
filter(age > 18) |
| Create columns | mutate() |
mutate(new = a + b) |
| Sort rows | arrange() |
arrange(desc(date)) |
| Rename columns | rename() |
rename(new = old) |
| Count groups | count() |
count(city) |
| Summarize | summarize() |
summarize(total = sum(x)) |
| Group data | group_by() |
group_by(city) |
| Apply to many | across() |
across(where(is.numeric), mean) |
| Join tables | left_join() |
left_join(b, by = "id") |
| Slice rows | slice() |
slice(1:10) |
๐ You Did It!
Youโve learned dplyrโs core tools. Now you can:
- โ Pick exactly the columns you need
- โ Filter to find the rows that matter
- โ Create new columns with calculations
- โ Sort data any way you want
- โ Rename confusing columns
- โ Count and summarize groups
- โ Apply operations to many columns
- โ Join multiple tables together
- โ Work with specific rows
Remember: The pipe %>% is your best friend. It reads like a recipe: โTake this, THEN do that, THEN do thisโฆโ
Happy data wrangling! ๐
