RL Fundamentals

Back

Loading concept...

Reinforcement Learning Fundamentals 🤖

The Story of Max the Robot Dog

Imagine you have a smart robot dog named Max. Max doesn’t know anything when you first turn him on. But here’s the magic—Max can learn from experience!

Every time Max does something good (like sitting when you say “sit”), you give him a treat. Every time he does something wrong (like chewing your shoes), you say “No!” Max remembers what works and what doesn’t. Over time, Max becomes the best-behaved robot dog ever!

This is exactly how Reinforcement Learning works.


What is Reinforcement Learning?

RL is teaching a computer to learn by trying things and seeing what happens.

graph TD A["🤖 Agent tries something"] --> B{What happened?} B -->|Good result| C["✅ Got a reward!"] B -->|Bad result| D["❌ Got punished"] C --> E["Do this more!"] D --> F["Avoid this next time"] E --> A F --> A

Real Life Examples:

  • A game AI learning to beat you at chess
  • A robot learning to walk without falling
  • YouTube learning what videos you like

RL Problem Formulation

Think of RL as a simple loop:

  1. Look at what’s happening
  2. Do something
  3. Get feedback (reward or punishment)
  4. Learn from what happened
  5. Repeat!

Example: Teaching a toddler to walk

  • Toddler looks around (sees the room)
  • Toddler takes a step (action)
  • Either stays standing (reward!) or falls (oops!)
  • Learns what movements work
  • Tries again and again

Agent and Environment

The Agent 🤖

The Agent is the learner—the one making decisions.

Think of the agent as a player in a video game. The player doesn’t control the world, but they can choose what to do in it.

Examples of Agents:

  • A robot vacuum deciding where to clean
  • A chess program deciding which piece to move
  • Max the robot dog deciding whether to sit or run

The Environment 🌍

The Environment is everything the agent interacts with.

It’s the world around the agent—the game board, the room, the maze.

Examples of Environments:

  • The chess board and pieces
  • Your house (for the robot vacuum)
  • The park (for Max the robot dog)
graph LR A["🤖 Agent"] -->|takes action| B["🌍 Environment"] B -->|sends back| C["📊 State + Reward"] C --> A

State and Observation

What is State? 📍

State = Everything about the world right now.

Imagine taking a photo of a chess game. That photo shows:

  • Where every piece is
  • Whose turn it is
  • Has anyone castled?

That complete picture is the state.

What is Observation? đź‘€

Sometimes the agent can’t see everything. What it CAN see is called an observation.

Example: In a card game:

  • Full state = All cards (yours, opponent’s, deck)
  • Observation = Only your cards + cards on table

Simple Analogy:

  • State = The whole room with lights on
  • Observation = What you see with a flashlight

Actions

Actions are things the agent can do.

In any situation, the agent picks ONE action from its list of possible moves.

Examples:

Agent Possible Actions
Chess AI Move pawn, move knight, castle…
Robot vacuum Go forward, turn left, turn right, dock
Max the dog Sit, bark, fetch, lie down

Key Point: The agent chooses actions. The environment responds to those actions.

graph TD A["Agent sees state"] --> B["Thinks about options"] B --> C["Picks best action"] C --> D["Does the action"] D --> E["Environment changes"]

Rewards

Rewards tell the agent if it did well or poorly.

Think of rewards like points in a video game:

  • +10 points for eating a cherry 🍒
  • -5 points for hitting a wall đź§±
  • +100 points for winning! 🏆

The Goal

The agent’s only goal: Get as many reward points as possible over time.

Examples:

Situation Reward
Robot vacuum cleans spot +1
Robot vacuum hits furniture -2
Game AI wins the game +100
Game AI loses the game -100

Important: Rewards can be:

  • Positive (good job, do more of this!)
  • Negative (bad move, avoid this!)
  • Zero (nothing special happened)

Policy

Policy = The agent’s strategy or game plan.

It answers: “When I see THIS situation, what should I DO?”

Simple Example:

Max the robot dog has a policy:

  • See “sit” command → Action: Sit down
  • See food bowl → Action: Walk to bowl
  • See stranger → Action: Bark

Written as Math (but simple!)

Policy is often written as π (the Greek letter “pi”).

Ď€(state) = action

English: “My policy tells me what action to take in each state.”

graph LR A["📍 Current State"] --> B["🧠 Policy π"] B --> C["✋ Action to take"]

Goal: Find the BEST policy—the one that gets the most rewards!


Value Function

Value Function answers: “How good is it to be HERE?”

Think of it like this:

  • You’re in a maze looking for treasure
  • Some spots are close to treasure (HIGH value)
  • Some spots are dead ends (LOW value)

Why It Matters

The value function helps the agent make smart choices.

Example:

Two paths in a video game:

  • Path A: Leads to a room with coins (Value = HIGH)
  • Path B: Leads to a monster (Value = LOW)

The value function says: “Go to Path A!”

Written Simply

V(state) = Expected total future rewards from this state

English: “How many points can I probably get from here?”


Q-Function (Action-Value Function)

Q-Function answers: “How good is it to take THIS action in THIS situation?”

It’s like the value function, but more specific.

The Difference

Function Question
Value (V) “How good is this place?”
Q-Function “How good is doing THIS action in this place?”

Example

Max is in the living room. He can:

  • Sit: Q = 10 (owner gives treat!)
  • Bark: Q = -5 (owner says “No!”)
  • Fetch ball: Q = 15 (owner plays with him!)

The Q-function tells Max: “Fetching the ball is the best choice here!”

Written Simply

Q(state, action) = Expected total rewards if I do this action here

graph TD A["📍 State: Living Room"] --> B["Sit → Q=10"] A --> C["Bark → Q=-5"] A --> D["Fetch → Q=15"] D --> E["🏆 Best Choice!"]

Putting It All Together

Let’s see how all pieces work together with Max the robot dog:

Concept Example
Agent Max the robot dog
Environment Your house
State Where Max is, what he sees
Observation What Max can actually sense
Actions Sit, bark, fetch, run
Rewards +5 for good behavior, -3 for bad
Policy Max’s strategy for getting treats
Value Function “The kitchen is great!” (often gets food)
Q-Function “Sitting when told = high reward”

The RL Learning Loop

Here’s how learning happens:

graph TD A["1. Agent sees State"] --> B["2. Policy picks Action"] B --> C["3. Agent does Action"] C --> D["4. Environment responds"] D --> E["5. Agent gets Reward"] E --> F["6. Agent updates Q/Value"] F --> G["7. Policy improves"] G --> A

Over many tries, the agent gets better and better!


Why This Matters 🌟

Reinforcement Learning is everywhere:

  • Self-driving cars learn to navigate roads
  • Game AIs like AlphaGo beat world champions
  • Robots learn to walk, grab, and dance
  • Recommendation systems learn what you like

You now understand the foundation! Every RL system uses:

  • An Agent making decisions
  • An Environment responding
  • States showing what’s happening
  • Actions the agent can take
  • Rewards guiding learning
  • Policies encoding strategies
  • Value/Q-Functions measuring goodness

You’ve just learned how machines learn to think! 🧠✨

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.