dplyr is an R package for data manipulation. It provides verbs like select, filter, and mutate to pick columns, filter rows, and create new data.

How does the pipe operator work in dplyr?

The pipe %>% chains operations together like a recipe. It passes output from one function as input to the next, making code readable.

What's the difference between select and filter in dplyr?

Select picks columns (like choosing ingredients), while filter keeps rows matching conditions (like finding orders over $20).

How do table joins work in dplyr?

Joins combine two tables by matching columns. Left join keeps all left rows, inner join keeps only matches, full join keeps everything.

dplyr in R | Data Manipulation Guide

🧹 The Data Kitchen: Mastering dplyr

Imagine you have a messy kitchen full of ingredients. dplyr is your magical chef’s toolkit that helps you pick, clean, organize, and combine ingredients to cook amazing data dishes!

🎯 What is dplyr?

Think of dplyr like a super-organized chef in a kitchen:

The kitchen is your data table (rows and columns)
Each row is one dish you’re preparing
Each column is an ingredient or detail about the dish

The chef (dplyr) has special tools to:

Pick only the ingredients you need
Keep only the dishes that match your order
Add new ingredients to dishes
Sort dishes in order
Count how many dishes you made
Combine ingredients from different shelves

library(tidyverse)  # Load dplyr and friends

🔧 SELECT: Pick Your Columns

The Ingredient Picker 🥕

Imagine a table with 100 columns. You only need 3. SELECT is like saying: “Give me just the carrots, onions, and tomatoes!”

Basic Select

# Original: name, age, city, job, salary
data %>%
  select(name, age, city)
# Result: just name, age, city

Select Multiple at Once

# Pick a range of columns
data %>%
  select(name:city)

# Pick all EXCEPT some
data %>%
  select(-salary, -job)

Smart Selection Helpers

# Columns that START with something
select(starts_with("date"))

# Columns that END with something
select(ends_with("_id"))

# Columns that CONTAIN something
select(contains("price"))

# Pick by position
select(1:3)  # First 3 columns

graph TD
    A["Original Table: 10 columns"] --> B["select name, age"]
    B --> C["Result: 2 columns"]
    style C fill:#90EE90

🔍 FILTER: Keep Only What You Want

The Order Checker 📋

A restaurant gets many orders. FILTER is like saying: “Show me only the pizza orders!” or “Only orders over $20!”

Basic Filter

# Keep rows where age is above 18
data %>%
  filter(age > 18)

# Keep rows where city is exactly "Paris"
data %>%
  filter(city == "Paris")

Multiple Conditions

# AND: both must be true
data %>%
  filter(age > 18, city == "Paris")

# Same thing with &
data %>%
  filter(age > 18 & city == "Paris")

# OR: either can be true
data %>%
  filter(city == "Paris" | city == "London")

Handy Filter Tricks

# Check if value is in a list
data %>%
  filter(city %in% c("Paris", "London", "Tokyo"))

# Remove missing values
data %>%
  filter(!is.na(age))

# Between two values
data %>%
  filter(between(age, 18, 65))

graph TD
    A["100 rows of data"] --> B{filter age > 18}
    B -->|Keep| C["45 rows match"]
    B -->|Remove| D["55 rows gone"]
    style C fill:#90EE90
    style D fill:#FFB6C1

✨ MUTATE: Create New Columns

The Recipe Inventor 👨‍🍳

You have flour and water. MUTATE lets you create something new: dough! It takes existing columns and makes new ones.

Basic Mutate

# Create a new column
data %>%
  mutate(age_in_months = age * 12)

# Create multiple at once
data %>%
  mutate(
    age_in_months = age * 12,
    is_adult = age >= 18
  )

Mutate with Conditions

# Use if_else for simple yes/no
data %>%
  mutate(
    status = if_else(age >= 18, "Adult", "Minor")
  )

# Use case_when for multiple options
data %>%
  mutate(
    age_group = case_when(
      age < 13 ~ "Child",
      age < 20 ~ "Teen",
      age < 60 ~ "Adult",
      TRUE ~ "Senior"
    )
  )

Modify Existing Columns

# Overwrite a column
data %>%
  mutate(name = toupper(name))

# Round numbers
data %>%
  mutate(price = round(price, 2))

📊 ARRANGE: Sort Your Data

The Organizer 📚

Like sorting books on a shelf! ARRANGE puts your rows in order.

Basic Arrange

# Sort A to Z (ascending)
data %>%
  arrange(name)

# Sort Z to A (descending)
data %>%
  arrange(desc(name))

Multiple Sort Levels

# Sort by city, then by age within each city
data %>%
  arrange(city, age)

# City A-Z, but age highest first
data %>%
  arrange(city, desc(age))

graph TD
    A["Unsorted Data"] --> B["arrange by age"]
    B --> C["10, 15, 22, 35, 48"]
    A --> D["arrange desc age"]
    D --> E["48, 35, 22, 15, 10"]
    style C fill:#90EE90
    style E fill:#87CEEB

🏷️ RENAME: Give Columns New Names

The Label Maker ✏️

Sometimes column names are confusing. RENAME lets you give them better names!

Basic Rename

# New name = old name
data %>%
  rename(customer_name = name)

# Rename multiple
data %>%
  rename(
    customer_name = name,
    customer_age = age
  )

Rename with Patterns

# Add prefix to all columns
data %>%
  rename_with(~ paste0("col_", .))

# Make all lowercase
data %>%
  rename_with(tolower)

# Replace spaces with underscores
data %>%
  rename_with(~ gsub(" ", "_", .))

🔢 COUNT: How Many of Each?

The Tally Keeper 📝

How many pizzas did we sell? How many customers from each city? COUNT tells you!

Basic Count

# Count rows for each city
data %>%
  count(city)
# Result: city, n (number)

# Count combinations
data %>%
  count(city, is_adult)

Count with Sorting

# Most common first
data %>%
  count(city, sort = TRUE)

# Custom column name
data %>%
  count(city, name = "total")

Add Count to Data

# Keep all rows, add count column
data %>%
  add_count(city, name = "city_total")

📈 SUMMARIZE: Calculate Totals

The Accountant 💰

“What’s the total sales?” “What’s the average age?” SUMMARIZE crunches many rows into one answer.

Basic Summarize

# One number from many rows
data %>%
  summarize(
    total_sales = sum(sales),
    avg_age = mean(age),
    count = n()
  )

Common Summary Functions

data %>%
  summarize(
    total = sum(x),      # Add up
    average = mean(x),   # Average
    middle = median(x),  # Middle value
    smallest = min(x),   # Minimum
    biggest = max(x),    # Maximum
    spread = sd(x),      # Spread out?
    how_many = n(),      # Count rows
    unique = n_distinct(x)  # Unique values
  )

👥 GROUP BY: Divide and Conquer

The Team Captain 🏆

Instead of one total for everyone, get totals for each GROUP! Like: “Sales per city” or “Average age per country.”

Basic Group By

# Total sales by city
data %>%
  group_by(city) %>%
  summarize(total = sum(sales))

Multiple Groups

# Sales by city AND year
data %>%
  group_by(city, year) %>%
  summarize(
    total = sum(sales),
    avg = mean(sales)
  )

Don’t Forget to Ungroup!

# After grouping, ungroup for safety
data %>%
  group_by(city) %>%
  summarize(total = sum(sales)) %>%
  ungroup()

graph TD
    A["All Data"] --> B["group_by city"]
    B --> C["Paris group"]
    B --> D["London group"]
    B --> E["Tokyo group"]
    C --> F["summarize each"]
    D --> F
    E --> F
    F --> G["One row per city"]
    style G fill:#90EE90

🔄 ACROSS: Apply to Many Columns

The Multi-Tool 🛠️

Want to do the same thing to many columns at once? ACROSS is your friend!

Basic Across

# Round all numeric columns
data %>%
  mutate(across(where(is.numeric), round))

# Convert selected columns to uppercase
data %>%
  mutate(across(c(name, city), toupper))

Across with Selection Helpers

# All columns starting with "price"
data %>%
  mutate(across(starts_with("price"), ~ . * 1.1))

# All character columns
data %>%
  mutate(across(where(is.character), tolower))

Across in Summarize

# Mean of all numeric columns
data %>%
  summarize(across(where(is.numeric), mean))

# Multiple summaries per column
data %>%
  summarize(across(
    c(sales, profit),
    list(total = sum, avg = mean)
  ))

🔗 TABLE JOINS: Combine Two Tables

The Matchmaker 💑

You have customers in one table and orders in another. JOINS connect them!

Left Join (Most Common)

# Keep ALL from left, match from right
customers %>%
  left_join(orders, by = "customer_id")

Right Join

# Keep ALL from right, match from left
customers %>%
  right_join(orders, by = "customer_id")

Inner Join

# Keep ONLY matching rows
customers %>%
  inner_join(orders, by = "customer_id")

Full Join

# Keep EVERYTHING from both
customers %>%
  full_join(orders, by = "customer_id")

Join by Different Column Names

# When names don't match
data1 %>%
  left_join(data2, by = c("id" = "customer_id"))

graph TD
    A["Customers Table"] --> E["LEFT JOIN"]
    B["Orders Table"] --> E
    E --> F["All customers + their orders"]

    A --> G["INNER JOIN"]
    B --> G
    G --> H["Only customers with orders"]

    style F fill:#90EE90
    style H fill:#87CEEB

📍 ROW OPERATIONS: Work with Rows

The Row Specialist 📋

Sometimes you need to slice, number, or work with specific rows.

Slice: Pick Specific Rows

# First 5 rows
data %>%
  slice(1:5)

# Last 3 rows
data %>%
  slice_tail(n = 3)

# Top 10 by sales
data %>%
  slice_max(sales, n = 10)

# Bottom 5 by price
data %>%
  slice_min(price, n = 5)

# Random sample
data %>%
  slice_sample(n = 100)

Row Numbers

# Add row number column
data %>%
  mutate(row_num = row_number())

# Number within groups
data %>%
  group_by(city) %>%
  mutate(rank = row_number())

Distinct Rows

# Remove duplicate rows
data %>%
  distinct()

# Distinct by specific columns
data %>%
  distinct(city, .keep_all = TRUE)

Row-wise Operations

# Calculate across columns per row
data %>%
  rowwise() %>%
  mutate(row_sum = sum(c(col1, col2, col3)))

🎪 Putting It All Together

Here’s a real example using everything:

library(tidyverse)

# Start with raw sales data
sales_data %>%
  # Pick columns we need
  select(date, store, product, quantity, price) %>%
  # Keep only 2024 data
  filter(year(date) == 2024) %>%
  # Calculate total per row
  mutate(total = quantity * price) %>%
  # Group by store
  group_by(store) %>%
  # Get summaries
  summarize(
    orders = n(),
    revenue = sum(total),
    avg_order = mean(total)
  ) %>%
  # Sort by revenue
  arrange(desc(revenue)) %>%
  # Clean column names
  rename(
    store_name = store,
    total_revenue = revenue
  )

🌟 Quick Reference

Task	Function	Example
Pick columns	`select()`	`select(name, age)`
Filter rows	`filter()`	`filter(age > 18)`
Create columns	`mutate()`	`mutate(new = a + b)`
Sort rows	`arrange()`	`arrange(desc(date))`
Rename columns	`rename()`	`rename(new = old)`
Count groups	`count()`	`count(city)`
Summarize	`summarize()`	`summarize(total = sum(x))`
Group data	`group_by()`	`group_by(city)`
Apply to many	`across()`	`across(where(is.numeric), mean)`
Join tables	`left_join()`	`left_join(b, by = "id")`
Slice rows	`slice()`	`slice(1:10)`

🎉 You Did It!

You’ve learned dplyr’s core tools. Now you can:

✅ Pick exactly the columns you need
✅ Filter to find the rows that matter
✅ Create new columns with calculations
✅ Sort data any way you want
✅ Rename confusing columns
✅ Count and summarize groups
✅ Apply operations to many columns
✅ Join multiple tables together
✅ Work with specific rows

Remember: The pipe %>% is your best friend. It reads like a recipe: “Take this, THEN do that, THEN do this…”

Happy data wrangling! 🚀

dplyr

Unable to load concept

Coming Soon...

🧹 The Data Kitchen: Mastering dplyr

🎯 What is dplyr?

🔧 SELECT: Pick Your Columns

Basic Select

Select Multiple at Once

Smart Selection Helpers

🔍 FILTER: Keep Only What You Want

Basic Filter

Multiple Conditions

Handy Filter Tricks

✨ MUTATE: Create New Columns

Basic Mutate

Mutate with Conditions

Modify Existing Columns

📊 ARRANGE: Sort Your Data

Basic Arrange

Multiple Sort Levels

🏷️ RENAME: Give Columns New Names

Basic Rename

Rename with Patterns

🔢 COUNT: How Many of Each?

Basic Count

Count with Sorting

Add Count to Data

📈 SUMMARIZE: Calculate Totals

Basic Summarize

Common Summary Functions

👥 GROUP BY: Divide and Conquer

Basic Group By

Multiple Groups

Don’t Forget to Ungroup!

🔄 ACROSS: Apply to Many Columns

Basic Across

Across with Selection Helpers

Across in Summarize

🔗 TABLE JOINS: Combine Two Tables

Left Join (Most Common)

Right Join

Inner Join

Full Join

Join by Different Column Names

📍 ROW OPERATIONS: Work with Rows

Slice: Pick Specific Rows

Row Numbers

Distinct Rows

Row-wise Operations

🎪 Putting It All Together

🌟 Quick Reference

🎉 You Did It!

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue