Linear Regression

Back

Loading concept...

Linear Regression in R: Building Your First Prediction Machine

The Story of the Fortune Teller

Imagine you have a magical crystal ball. But this crystal ball is special - it learns from the past to predict the future. You tell it: “When I study 2 hours, I score 70 marks. When I study 4 hours, I score 85 marks.” The crystal ball notices a pattern and says: “Ah! More study hours = higher scores. Let me draw a line through your data!”

That’s Linear Regression - drawing the best possible straight line through your data points to make predictions.


Formula Objects: Teaching R What to Predict

Before we can predict anything, we need to tell R what we want to predict and what we’ll use to predict it. We do this with a formula.

The Magic Recipe

A formula in R looks like this:

y ~ x

Think of it as saying: “I want to predict y using x”

The ~ symbol (called tilde) means “depends on” or “is predicted by”.

Real Examples

# Predict score based on hours studied
score ~ hours

# Predict house price based on size
price ~ size

# Multiple predictors? No problem!
salary ~ experience + education

# Everything as a predictor
mpg ~ .

Quick Reference

Formula Meaning
y ~ x y depends on x
y ~ x1 + x2 y depends on x1 AND x2
y ~ . y depends on ALL other columns
y ~ x - 1 No intercept (line through origin)

Linear Regression: Drawing the Best Line

Now the exciting part! We use lm() - which stands for Linear Model.

Your First Regression

# Create some data
study_data <- data.frame(
  hours = c(1, 2, 3, 4, 5),
  score = c(55, 60, 70, 75, 85)
)

# Build the model
my_model <- lm(score ~ hours,
               data = study_data)

That’s it! R just drew the best possible line through your data.

What Just Happened?

graph TD A["Your Data Points"] --> B["lm function"] B --> C["Finds Best Line"] C --> D["y = intercept + slope Ă— x"] D --> E["Ready to Predict!"]

See Your Line

# What's the equation?
my_model

# Output:
# (Intercept)    hours
#      47.0       7.5

This tells us: score = 47 + 7.5 Ă— hours

So if you study 6 hours: score = 47 + 7.5 Ă— 6 = 92 marks!


Regression Summary: The Full Report Card

Want to know how good your prediction line is? Use summary().

Getting the Summary

summary(my_model)

Understanding the Output

The summary shows you several important things:

1. Coefficients Table

            Estimate Std.Error t value Pr(>|t|)
(Intercept)   47.00     4.18    11.24   0.0015 **
hours          7.50     1.12     6.71   0.0067 **
  • Estimate: The actual numbers in your equation
  • Std. Error: How uncertain we are about each number
  • t value: How confident we are (bigger = more confident)
  • Pr(>|t|): The p-value (smaller = more significant)

2. R-squared: Your Model’s Grade

Multiple R-squared: 0.937

Think of R-squared as a percentage score:

  • 0.937 = 93.7% of the variation in scores is explained by study hours
  • Higher is better (but not always 1.0!)
graph TD A["R² = 0"] --> B["Line explains nothing"] C["R² = 0.5"] --> D["Line explains 50%"] E["R² = 1.0"] --> F["Perfect prediction"]

Quick Summary Cheatsheet

Metric Good Sign
p-value < 0.05
R-squared > 0.7
Residual SE Low number

Prediction from Models: Using Your Crystal Ball

Now the fun part - making predictions!

The predict() Function

# New data to predict
new_students <- data.frame(
  hours = c(6, 7, 8)
)

# Make predictions
predict(my_model,
        newdata = new_students)

# Output: 92.0  99.5  107.0

Predictions with Confidence

Not 100% sure about your predictions? Get a range!

# Confidence interval
predict(my_model,
        newdata = new_students,
        interval = "confidence")

#      fit    lwr    upr
# 1  92.0   84.2   99.8
# 2  99.5   89.1  109.9
# 3 107.0   93.2  120.8
  • fit: Your prediction
  • lwr: Lower bound (95% confident it’s above this)
  • upr: Upper bound (95% confident it’s below this)

Prediction vs Confidence Intervals

# For the mean response
interval = "confidence"

# For a single new observation
interval = "prediction"

Prediction intervals are wider because individual values vary more than averages.


Regression Diagnostics: Is Your Model Healthy?

Just like a doctor checks your health, we need to check our model’s health.

The 4 Key Checks

# Create 4 diagnostic plots
par(mfrow = c(2, 2))
plot(my_model)

This gives you 4 important plots:

1. Residuals vs Fitted

  • Should look like random scatter
  • No patterns allowed!

2. Normal Q-Q

  • Points should follow the diagonal line
  • Checks if errors are normally distributed

3. Scale-Location

  • Should be a flat horizontal band
  • Checks for constant variance

4. Residuals vs Leverage

  • Identifies influential outliers
  • Watch for points outside dashed lines
graph TD A["Run Diagnostics"] --> B{Patterns?} B -->|No Pattern| C["Model is Healthy!"] B -->|Pattern Found| D["Model Needs Help"] D --> E["Transform Data"] D --> F["Add Variables"] D --> G["Remove Outliers"]

Residual Analysis: Learning from Mistakes

A residual is the difference between what actually happened and what your model predicted.

Calculate Residuals

# Get residuals
residuals(my_model)

# Or equivalently
my_model$residuals

Residual = Actual - Predicted

# If actual score = 70
# And predicted = 67
# Residual = 70 - 67 = 3

What Residuals Tell Us

Pattern What It Means
Random scatter around 0 Model is good!
Curve pattern Need polynomial term
Funnel shape Variance is not constant
Outliers Some unusual data points

Visualizing Residuals

# Simple residual plot
plot(my_model$fitted.values,
     my_model$residuals)
abline(h = 0, col = "red")

# Should look like random dots
# around the red line

Checking Normality

# Histogram of residuals
hist(residuals(my_model))

# Should look like a bell curve

# Shapiro-Wilk test
shapiro.test(residuals(my_model))
# p-value > 0.05 means normal

Influence Measures: Finding the Troublemakers

Some data points have more power than others. One weird point can pull your whole line in the wrong direction!

Three Key Measures

1. Leverage (Hat Values) How far a point is from the center of your x values.

hatvalues(my_model)

High leverage points are at the edges of your data.

2. Cook’s Distance The overall influence of each point on your model.

cooks.distance(my_model)

Rule of thumb: Watch points where Cook’s D > 4/n

3. DFBETAS How much each point changes each coefficient.

dfbetas(my_model)

Visual Detection

# Plot Cook's distance
plot(my_model, which = 4)

# Points with high bars are
# influential!

The influence.measures() Function

Get everything at once:

influence.measures(my_model)

# Shows:
# - dfb.1_: Change in intercept
# - dfb.hour: Change in slope
# - dffit: Overall fit change
# - cov.r: Covariance ratio
# - cook.d: Cook's distance
# - hat: Leverage

What To Do With Influential Points?

graph TD A["Find Influential Point"] --> B{Is it valid data?} B -->|Yes, valid| C["Keep it, report it"] B -->|Data entry error| D["Fix or remove"] B -->|Outlier| E["Run model with and without"] E --> F["Compare results"]

Putting It All Together: The Complete Workflow

# 1. FORMULA: Define what to predict
formula <- score ~ hours

# 2. FIT: Build the model
model <- lm(formula, data = my_data)

# 3. SUMMARY: Check performance
summary(model)

# 4. DIAGNOSE: Check assumptions
par(mfrow = c(2,2))
plot(model)

# 5. RESIDUALS: Analyze errors
hist(residuals(model))

# 6. INFLUENCE: Find outliers
influence.measures(model)

# 7. PREDICT: Make predictions!
predict(model, newdata = new_data)

Key Takeaways

  1. Formula Objects (y ~ x) tell R what to predict
  2. lm() builds the linear model
  3. summary() shows how well it works
  4. predict() makes new predictions
  5. Diagnostics check if assumptions are met
  6. Residuals show where the model makes mistakes
  7. Influence Measures find powerful outliers

Your Model is Your Crystal Ball

You’ve learned to:

  • Write formulas that tell R your prediction goal
  • Build models that learn patterns from data
  • Read summaries to know how good your predictions are
  • Predict new values with confidence intervals
  • Check your model’s health with diagnostics
  • Analyze residuals to spot problems
  • Find influential points that might cause trouble

Now go predict the future! Just remember: your predictions are only as good as the pattern in your data. If the future breaks the pattern, even the best model won’t see it coming.

Happy Modeling!

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.