Unsupervised Learning

Back

Loading concept...

🎭 The Sorting Hat for Data: Unsupervised Learning in R

Imagine you have a giant box of LEGO pieces β€” all different colors, shapes, and sizes. Nobody tells you how to sort them. But somehow, you figure out which pieces belong together. That’s exactly what unsupervised learning does with data!


🌟 The Big Picture

Unsupervised learning is like being a detective with no clues about what you’re looking for. You just look at the data and find hidden patterns all by yourself.

No labels. No answers. Just discovery.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Supervised Learning:               β”‚
β”‚  "Here are cats and dogs. Learn!"   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Unsupervised Learning:             β”‚
β”‚  "Here's data. Find patterns!"      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Our Analogy: Think of data as guests at a party. Unsupervised learning figures out who naturally belongs together β€” without anyone wearing name tags!


πŸ“ Distance Measures: How Close Are Two Points?

Before we group anything, we need to answer: How do we measure β€œcloseness”?

The Ruler for Data

Imagine you’re in a city. How far is the coffee shop?

Euclidean Distance β€” The bird flies straight there (shortest path)

# Two friends' locations
alice <- c(2, 3)   # x=2, y=3
bob <- c(5, 7)     # x=5, y=7

# How far apart?
dist_euclidean <- sqrt(
  (5-2)^2 + (7-3)^2
)
# Result: 5 units

Manhattan Distance β€” Walk on streets (like NYC blocks)

# Same friends
dist_manhattan <- abs(5-2) + abs(7-3)
# Result: 7 units (3 blocks + 4 blocks)

Quick Comparison

Distance Type How It Works Best For
Euclidean Straight line Continuous data
Manhattan Grid path City-block problems
Cosine Angle between Text similarity

R’s Built-in Help

# Create sample data
points <- matrix(
  c(0,0, 3,4, 1,1),
  nrow = 3, byrow = TRUE
)

# Calculate all distances at once
dist(points, method = "euclidean")

🧭 Principal Component Analysis (PCA)

The Problem: Too Many Ingredients!

Imagine a recipe with 100 ingredients. Most of them barely change the taste. PCA finds the ingredients that matter most.

The Magic Trick

PCA is like taking a 3D photo of a sculpture from the best angle β€” capturing most of the beauty in 2D.

graph TD A["100 Features"] --> B["PCA Magic"] B --> C["2-3 Important Features"] C --> D["Same Story, Simpler!"]

How It Works (Simple Version)

  1. Find the direction where data spreads the most
  2. That’s PC1 β€” your first β€œsuper-ingredient”
  3. Find the next direction (perpendicular to PC1)
  4. That’s PC2 β€” your second β€œsuper-ingredient”
  5. Keep going until you’ve captured enough variation

PCA in R

# Sample: Student scores in 5 subjects
scores <- data.frame(
  math = c(90,85,75,95,70),
  physics = c(88,82,78,92,68),
  chemistry = c(85,80,72,90,65),
  english = c(70,75,85,65,90),
  history = c(72,78,88,60,92)
)

# Apply PCA
pca_result <- prcomp(
  scores,
  scale. = TRUE  # Standardize!
)

# See importance
summary(pca_result)

What You Get

# PC1 might be "Science ability"
# PC2 might be "Humanities ability"
# Two numbers tell the whole story!

🎯 Key Insight

If PC1 + PC2 explain 90% of variation, you can ignore the other components. That’s the power of dimensionality reduction!


🎯 K-Means Clustering

The Party Organizer

Imagine you’re organizing a party with 50 guests. You want to seat them at K tables so similar people sit together.

K-Means does exactly this!

The Algorithm Dance

graph TD A["Step 1: Pick K random centers"] --> B["Step 2: Assign each point to nearest center"] B --> C["Step 3: Move center to middle of its group"] C --> D{Points moved?} D -->|Yes| B D -->|No| E["Done! K clusters found"]

K-Means in R

# Customer data: age and spending
customers <- data.frame(
  age = c(25,30,35,50,55,60,22,28),
  spending = c(200,220,180,400,450,380,190,210)
)

# Create 2 groups
km <- kmeans(customers, centers = 2)

# See who's in which group
km$cluster
# Maybe: Young Savers vs. Mature Spenders

Choosing K: The Elbow Method

How many groups? Look for the β€œelbow” in the plot!

# Try different K values
wss <- sapply(1:6, function(k) {
  kmeans(customers, k)$tot.withinss
})

# Plot it
plot(1:6, wss, type = "b",
     xlab = "Number of Clusters",
     ylab = "Total Within Sum of Squares")
# Look for the bend!

⚠️ Watch Out!

Problem Solution
Results change each run Set set.seed(123)
Different scales Standardize first!
Weird shapes Try other methods

🌳 Hierarchical Clustering

Building a Family Tree for Data

Remember the K-Means party? Hierarchical clustering is different. It builds a family tree showing how everyone is related.

Two Approaches

Bottom-Up (Agglomerative) β€” Start with individuals, merge into families

Top-Down (Divisive) β€” Start with everyone, split into groups

The Merging Process

graph TD A["Each point is its own cluster"] --> B["Find two closest clusters"] B --> C["Merge them into one"] C --> D{One cluster left?} D -->|No| B D -->|Yes| E["Tree complete!"]

Hierarchical Clustering in R

# Same customer data
customers <- data.frame(
  age = c(25,30,35,50,55,60,22,28),
  spending = c(200,220,180,400,450,380,190,210)
)

# Calculate distances
d <- dist(customers)

# Build the tree
hc <- hclust(d, method = "complete")

# Draw it!
plot(hc)

Cutting the Tree

# Want 2 groups? Cut at height that gives 2
groups <- cutree(hc, k = 2)

# Or cut at specific height
groups <- cutree(hc, h = 150)

Linkage Methods

Method How It Measures Distance
Single Closest pair
Complete Farthest pair
Average Mean of all pairs
Ward’s Minimizes variance

When to Use What?

K-Means:
βœ“ Know how many groups
βœ“ Large datasets
βœ“ Spherical clusters

Hierarchical:
βœ“ Don't know group count
βœ“ Want to see relationships
βœ“ Smaller datasets

🎨 Putting It All Together

The Complete Workflow

# 1. Load and prepare data
data <- scale(mydata)  # Standardize!

# 2. Maybe reduce dimensions first
pca <- prcomp(data)
data_reduced <- pca$x[, 1:2]  # Use PC1 & PC2

# 3. Choose clustering method
# For unknown K:
hc <- hclust(dist(data_reduced))
plot(hc)  # Look at tree

# For known K:
km <- kmeans(data_reduced, centers = 3)

Visualization

# Color by cluster
plot(
  data_reduced,
  col = km$cluster,
  pch = 19,
  main = "My Clusters!"
)

# Add cluster centers
points(
  km$centers,
  col = 1:3, pch = 8, cex = 2
)

🎁 Quick Summary

Technique What It Does Think Of It As
Distance Measures Measures closeness A ruler for data
PCA Reduces dimensions Finding best camera angle
K-Means Groups into K clusters Seating party guests
Hierarchical Builds cluster tree Creating family tree

πŸš€ You’re Ready!

You now understand the four pillars of unsupervised learning in R:

  1. Measure distances between points
  2. Simplify with PCA when needed
  3. Group with K-Means or Hierarchical clustering
  4. Visualize and interpret your discoveries

Remember: There’s no β€œright answer” in unsupervised learning. You’re an explorer discovering patterns in data. Sometimes the most interesting findings are the ones nobody expected!

β€œThe goal is to turn data into information, and information into insight.” β€” Carly Fiorina

Happy clustering! πŸŽ‰

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.