dplyr

Back

Loading concept...

๐Ÿงน The Data Kitchen: Mastering dplyr

Imagine you have a messy kitchen full of ingredients. dplyr is your magical chefโ€™s toolkit that helps you pick, clean, organize, and combine ingredients to cook amazing data dishes!


๐ŸŽฏ What is dplyr?

Think of dplyr like a super-organized chef in a kitchen:

  • The kitchen is your data table (rows and columns)
  • Each row is one dish youโ€™re preparing
  • Each column is an ingredient or detail about the dish

The chef (dplyr) has special tools to:

  • Pick only the ingredients you need
  • Keep only the dishes that match your order
  • Add new ingredients to dishes
  • Sort dishes in order
  • Count how many dishes you made
  • Combine ingredients from different shelves
library(tidyverse)  # Load dplyr and friends

๐Ÿ”ง SELECT: Pick Your Columns

The Ingredient Picker ๐Ÿฅ•

Imagine a table with 100 columns. You only need 3. SELECT is like saying: โ€œGive me just the carrots, onions, and tomatoes!โ€

Basic Select

# Original: name, age, city, job, salary
data %>%
  select(name, age, city)
# Result: just name, age, city

Select Multiple at Once

# Pick a range of columns
data %>%
  select(name:city)

# Pick all EXCEPT some
data %>%
  select(-salary, -job)

Smart Selection Helpers

# Columns that START with something
select(starts_with("date"))

# Columns that END with something
select(ends_with("_id"))

# Columns that CONTAIN something
select(contains("price"))

# Pick by position
select(1:3)  # First 3 columns
graph TD A["Original Table: 10 columns"] --> B["select name, age"] B --> C["Result: 2 columns"] style C fill:#90EE90

๐Ÿ” FILTER: Keep Only What You Want

The Order Checker ๐Ÿ“‹

A restaurant gets many orders. FILTER is like saying: โ€œShow me only the pizza orders!โ€ or โ€œOnly orders over $20!โ€

Basic Filter

# Keep rows where age is above 18
data %>%
  filter(age > 18)

# Keep rows where city is exactly "Paris"
data %>%
  filter(city == "Paris")

Multiple Conditions

# AND: both must be true
data %>%
  filter(age > 18, city == "Paris")

# Same thing with &
data %>%
  filter(age > 18 & city == "Paris")

# OR: either can be true
data %>%
  filter(city == "Paris" | city == "London")

Handy Filter Tricks

# Check if value is in a list
data %>%
  filter(city %in% c("Paris", "London", "Tokyo"))

# Remove missing values
data %>%
  filter(!is.na(age))

# Between two values
data %>%
  filter(between(age, 18, 65))
graph TD A["100 rows of data"] --> B{filter age > 18} B -->|Keep| C["45 rows match"] B -->|Remove| D["55 rows gone"] style C fill:#90EE90 style D fill:#FFB6C1

โœจ MUTATE: Create New Columns

The Recipe Inventor ๐Ÿ‘จโ€๐Ÿณ

You have flour and water. MUTATE lets you create something new: dough! It takes existing columns and makes new ones.

Basic Mutate

# Create a new column
data %>%
  mutate(age_in_months = age * 12)

# Create multiple at once
data %>%
  mutate(
    age_in_months = age * 12,
    is_adult = age >= 18
  )

Mutate with Conditions

# Use if_else for simple yes/no
data %>%
  mutate(
    status = if_else(age >= 18, "Adult", "Minor")
  )

# Use case_when for multiple options
data %>%
  mutate(
    age_group = case_when(
      age < 13 ~ "Child",
      age < 20 ~ "Teen",
      age < 60 ~ "Adult",
      TRUE ~ "Senior"
    )
  )

Modify Existing Columns

# Overwrite a column
data %>%
  mutate(name = toupper(name))

# Round numbers
data %>%
  mutate(price = round(price, 2))

๐Ÿ“Š ARRANGE: Sort Your Data

The Organizer ๐Ÿ“š

Like sorting books on a shelf! ARRANGE puts your rows in order.

Basic Arrange

# Sort A to Z (ascending)
data %>%
  arrange(name)

# Sort Z to A (descending)
data %>%
  arrange(desc(name))

Multiple Sort Levels

# Sort by city, then by age within each city
data %>%
  arrange(city, age)

# City A-Z, but age highest first
data %>%
  arrange(city, desc(age))
graph TD A["Unsorted Data"] --> B["arrange by age"] B --> C["10, 15, 22, 35, 48"] A --> D["arrange desc age"] D --> E["48, 35, 22, 15, 10"] style C fill:#90EE90 style E fill:#87CEEB

๐Ÿท๏ธ RENAME: Give Columns New Names

The Label Maker โœ๏ธ

Sometimes column names are confusing. RENAME lets you give them better names!

Basic Rename

# New name = old name
data %>%
  rename(customer_name = name)

# Rename multiple
data %>%
  rename(
    customer_name = name,
    customer_age = age
  )

Rename with Patterns

# Add prefix to all columns
data %>%
  rename_with(~ paste0("col_", .))

# Make all lowercase
data %>%
  rename_with(tolower)

# Replace spaces with underscores
data %>%
  rename_with(~ gsub(" ", "_", .))

๐Ÿ”ข COUNT: How Many of Each?

The Tally Keeper ๐Ÿ“

How many pizzas did we sell? How many customers from each city? COUNT tells you!

Basic Count

# Count rows for each city
data %>%
  count(city)
# Result: city, n (number)

# Count combinations
data %>%
  count(city, is_adult)

Count with Sorting

# Most common first
data %>%
  count(city, sort = TRUE)

# Custom column name
data %>%
  count(city, name = "total")

Add Count to Data

# Keep all rows, add count column
data %>%
  add_count(city, name = "city_total")

๐Ÿ“ˆ SUMMARIZE: Calculate Totals

The Accountant ๐Ÿ’ฐ

โ€œWhatโ€™s the total sales?โ€ โ€œWhatโ€™s the average age?โ€ SUMMARIZE crunches many rows into one answer.

Basic Summarize

# One number from many rows
data %>%
  summarize(
    total_sales = sum(sales),
    avg_age = mean(age),
    count = n()
  )

Common Summary Functions

data %>%
  summarize(
    total = sum(x),      # Add up
    average = mean(x),   # Average
    middle = median(x),  # Middle value
    smallest = min(x),   # Minimum
    biggest = max(x),    # Maximum
    spread = sd(x),      # Spread out?
    how_many = n(),      # Count rows
    unique = n_distinct(x)  # Unique values
  )

๐Ÿ‘ฅ GROUP BY: Divide and Conquer

The Team Captain ๐Ÿ†

Instead of one total for everyone, get totals for each GROUP! Like: โ€œSales per cityโ€ or โ€œAverage age per country.โ€

Basic Group By

# Total sales by city
data %>%
  group_by(city) %>%
  summarize(total = sum(sales))

Multiple Groups

# Sales by city AND year
data %>%
  group_by(city, year) %>%
  summarize(
    total = sum(sales),
    avg = mean(sales)
  )

Donโ€™t Forget to Ungroup!

# After grouping, ungroup for safety
data %>%
  group_by(city) %>%
  summarize(total = sum(sales)) %>%
  ungroup()
graph TD A["All Data"] --> B["group_by city"] B --> C["Paris group"] B --> D["London group"] B --> E["Tokyo group"] C --> F["summarize each"] D --> F E --> F F --> G["One row per city"] style G fill:#90EE90

๐Ÿ”„ ACROSS: Apply to Many Columns

The Multi-Tool ๐Ÿ› ๏ธ

Want to do the same thing to many columns at once? ACROSS is your friend!

Basic Across

# Round all numeric columns
data %>%
  mutate(across(where(is.numeric), round))

# Convert selected columns to uppercase
data %>%
  mutate(across(c(name, city), toupper))

Across with Selection Helpers

# All columns starting with "price"
data %>%
  mutate(across(starts_with("price"), ~ . * 1.1))

# All character columns
data %>%
  mutate(across(where(is.character), tolower))

Across in Summarize

# Mean of all numeric columns
data %>%
  summarize(across(where(is.numeric), mean))

# Multiple summaries per column
data %>%
  summarize(across(
    c(sales, profit),
    list(total = sum, avg = mean)
  ))

๐Ÿ”— TABLE JOINS: Combine Two Tables

The Matchmaker ๐Ÿ’‘

You have customers in one table and orders in another. JOINS connect them!

Left Join (Most Common)

# Keep ALL from left, match from right
customers %>%
  left_join(orders, by = "customer_id")

Right Join

# Keep ALL from right, match from left
customers %>%
  right_join(orders, by = "customer_id")

Inner Join

# Keep ONLY matching rows
customers %>%
  inner_join(orders, by = "customer_id")

Full Join

# Keep EVERYTHING from both
customers %>%
  full_join(orders, by = "customer_id")

Join by Different Column Names

# When names don't match
data1 %>%
  left_join(data2, by = c("id" = "customer_id"))
graph TD A["Customers Table"] --> E["LEFT JOIN"] B["Orders Table"] --> E E --> F["All customers + their orders"] A --> G["INNER JOIN"] B --> G G --> H["Only customers with orders"] style F fill:#90EE90 style H fill:#87CEEB

๐Ÿ“ ROW OPERATIONS: Work with Rows

The Row Specialist ๐Ÿ“‹

Sometimes you need to slice, number, or work with specific rows.

Slice: Pick Specific Rows

# First 5 rows
data %>%
  slice(1:5)

# Last 3 rows
data %>%
  slice_tail(n = 3)

# Top 10 by sales
data %>%
  slice_max(sales, n = 10)

# Bottom 5 by price
data %>%
  slice_min(price, n = 5)

# Random sample
data %>%
  slice_sample(n = 100)

Row Numbers

# Add row number column
data %>%
  mutate(row_num = row_number())

# Number within groups
data %>%
  group_by(city) %>%
  mutate(rank = row_number())

Distinct Rows

# Remove duplicate rows
data %>%
  distinct()

# Distinct by specific columns
data %>%
  distinct(city, .keep_all = TRUE)

Row-wise Operations

# Calculate across columns per row
data %>%
  rowwise() %>%
  mutate(row_sum = sum(c(col1, col2, col3)))

๐ŸŽช Putting It All Together

Hereโ€™s a real example using everything:

library(tidyverse)

# Start with raw sales data
sales_data %>%
  # Pick columns we need
  select(date, store, product, quantity, price) %>%
  # Keep only 2024 data
  filter(year(date) == 2024) %>%
  # Calculate total per row
  mutate(total = quantity * price) %>%
  # Group by store
  group_by(store) %>%
  # Get summaries
  summarize(
    orders = n(),
    revenue = sum(total),
    avg_order = mean(total)
  ) %>%
  # Sort by revenue
  arrange(desc(revenue)) %>%
  # Clean column names
  rename(
    store_name = store,
    total_revenue = revenue
  )

๐ŸŒŸ Quick Reference

Task Function Example
Pick columns select() select(name, age)
Filter rows filter() filter(age > 18)
Create columns mutate() mutate(new = a + b)
Sort rows arrange() arrange(desc(date))
Rename columns rename() rename(new = old)
Count groups count() count(city)
Summarize summarize() summarize(total = sum(x))
Group data group_by() group_by(city)
Apply to many across() across(where(is.numeric), mean)
Join tables left_join() left_join(b, by = "id")
Slice rows slice() slice(1:10)

๐ŸŽ‰ You Did It!

Youโ€™ve learned dplyrโ€™s core tools. Now you can:

  • โœ… Pick exactly the columns you need
  • โœ… Filter to find the rows that matter
  • โœ… Create new columns with calculations
  • โœ… Sort data any way you want
  • โœ… Rename confusing columns
  • โœ… Count and summarize groups
  • โœ… Apply operations to many columns
  • โœ… Join multiple tables together
  • โœ… Work with specific rows

Remember: The pipe %>% is your best friend. It reads like a recipe: โ€œTake this, THEN do that, THEN do thisโ€ฆโ€

Happy data wrangling! ๐Ÿš€

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.