Markov Decision Process¶
Reinforcement learning (RL) provides a computational approach for agents to learn optimal behaviors through interactions with their environment, aiming to maximize cumulative rewards. The Markov Decision Process (MDP) offers a formal mathematical structure to represent and analyze reinforcement learning scenarios.
An MDP consists of these essential elements:
- State and observation: A state
represents a complete description of the environment at a given moment, containing all relevant information. In contrast, an observation provides only partial information about the current state. The collection of all possible states forms the state space, denoted as , which can be either discrete (like a chess configuration) or continuous (such as a robot's position and velocity). - Action: An action
represents a choice the agent makes that can be observed. The collection of all possible actions constitutes the action space, denoted as . This space may be discrete (such as directional movements) or continuous (like velocity vectors in a physical system). - Transition probability: The transition probability
defines the likelihood of reaching state after executing action in state . For continuous state spaces, this becomes a probability density function. - Reward function: The reward function
specifies the immediate reward value that the agent receives after performing action in state . - Policy: The policy
is the probability distribution (or density) over action space given states .
As the agent interacts with the environment within an MDP framework using a policy
- At time
, the agent observes the initial state . - The agent selects an action
according to the policy , i.e., . - The agent receives a reward
. - The environment transitions from state
to state with probability , i.e., . - The process repeats for $t=1, 2, \cdots $.
Value Function¶
In reinforcement learning, our objective is to find a policy
where
There are four key functions that are fundamental to reinforcement learning:
- On-Policy Value Function (
): This function calculates the expected return when starting in state and following policy thereafter:
- On-Policy Action-Value Function (
): This function calculates the expected return when starting in state , taking action (which might not be according to the policy), and then following policy afterward:
- Optimal Value Function (
): This function calculates the expected return when starting in state and following the optimal policy in the environment:
- Optimal Action-Value Function (
): This function calculates the expected return when starting in state , taking action , and then following the optimal policy in the environment:
Advantage Functions¶
In reinforcement learning, we often need to measure how much better one action is compared to the average action in a given state. This is captured by the advantage function:
The advantage function
Bellman Equation¶
All four value functions satisfy recursive self-consistency equations known as Bellman equations.
For the on-policy value functions, the Bellman equations are:
For the optimal value functions, the Bellman equations are:
These equations are essentially applications of dynamic programming principles that we covered earlier.
We only prove the Bellman equation for the on-policy value function
The proof for the other three value functions are similar.
Temporal Difference¶
The Bellman equation provides a recursive relationship that allows us to compute value functions iteratively. Given a trajectory
Value Function Estimation¶
To estimate
- Monte Carlo Sampling: Let
be the return of the trajectory since time step , also called reward-to-go. We directly use the definition of the value function to compute the value function.
Notice that the reward-to-go
You can also estimate the
- Temporal Difference: We use the Bellman equation and the temporal difference error to learn the value network.
Similarly, you can estimate the
You can even estimate the optimal