Skip to content

Markov Decision Process

Reinforcement learning (RL) provides a computational approach for agents to learn optimal behaviors through interactions with their environment, aiming to maximize cumulative rewards. The Markov Decision Process (MDP) offers a formal mathematical structure to represent and analyze reinforcement learning scenarios.

An MDP consists of these essential elements:

  • State and observation: A state s represents a complete description of the environment at a given moment, containing all relevant information. In contrast, an observation o provides only partial information about the current state. The collection of all possible states forms the state space, denoted as S, which can be either discrete (like a chess configuration) or continuous (such as a robot's position and velocity).
  • Action: An action a represents a choice the agent makes that can be observed. The collection of all possible actions constitutes the action space, denoted as A. This space may be discrete (such as directional movements) or continuous (like velocity vectors in a physical system).
  • Transition probability: The transition probability P(s|s,a) defines the likelihood of reaching state s after executing action a in state s. For continuous state spaces, this becomes a probability density function.
  • Reward function: The reward function R(s,a) specifies the immediate reward value that the agent receives after performing action a in state s.
  • Policy: The policy π(a|s) is the probability distribution (or density) over action space aA given states sS.

MDP

As the agent interacts with the environment within an MDP framework using a policy π, we observe a sequence of states and actions τ=(s0,a0,r0,s1,a1,r1,,st,at,rt) following the following process:

  • At time t=0, the agent observes the initial state s0.
  • The agent selects an action a0 according to the policy π, i.e., a0π(|s0).
  • The agent receives a reward r0=R(s0,a0).
  • The environment transitions from state s0 to state s1 with probability P(s1|s0,a0), i.e., s1P(|s0,a0).
  • The process repeats for $t=1, 2, \cdots $.

Value Function

In reinforcement learning, our objective is to find a policy πθ that maximizes the expected cumulative reward

J(θ)=Eτπθ[R(τ)], where R(τ)=t=0γtrt

where γ represents the discount factor. We usually use neural networks to parameterize the policy πθ which is called a policy network.

policy_network

There are four key functions that are fundamental to reinforcement learning:

  • On-Policy Value Function (Vπ(s)): This function calculates the expected return when starting in state s and following policy π thereafter:
Vπ(s)=Eτπ[R(τ)|s0=s]
  • On-Policy Action-Value Function (Qπ(s,a)): This function calculates the expected return when starting in state s, taking action a (which might not be according to the policy), and then following policy π afterward:
Qπ(s,a)=Eτπ[R(τ)|s0=s,a0=a]
  • Optimal Value Function (V(s)): This function calculates the expected return when starting in state s and following the optimal policy in the environment:
V(s)=maxπEτπ[R(τ)|s0=s]
  • Optimal Action-Value Function (Q(s,a)): This function calculates the expected return when starting in state s, taking action a, and then following the optimal policy in the environment:
Q(s,a)=maxπEτπ[R(τ)|s0=s,a0=a]

Advantage Functions

In reinforcement learning, we often need to measure how much better one action is compared to the average action in a given state. This is captured by the advantage function:

Aπ(s,a)=Qπ(s,a)Vπ(s)

The advantage function Aπ(s,a) quantifies the relative benefit of taking a specific action a in state s, compared to selecting an action randomly according to policy π(·|s), assuming you follow policy π thereafter.

Bellman Equation

All four value functions satisfy recursive self-consistency equations known as Bellman equations.

For the on-policy value functions, the Bellman equations are:

Vπ(s)=Eaπ,sP[r(s,a)+γVπ(s)],Qπ(s,a)=EsP,aπ[r(s,a)+γVπ(s)]=EsP,aπ[r(s,a)+γQπ(s,a)],

For the optimal value functions, the Bellman equations are:

V(s)=maxaEsP[r(s,a)+γV(s)],Q(s,a)=EsP[r(s,a)+γmaxaQ(s,a)],

These equations are essentially applications of dynamic programming principles that we covered earlier.

We only prove the Bellman equation for the on-policy value function Vπ(s).

Vπ(s)=Eτπ[R(τ)|s0=s]=Eτπ[r0+γr1+γ2r2+|s0=s]=Eτπ[r0+γ(r1+γr2+)|s0=s]=Eτπ[r0+γEτπ[R(τ)|s1=s1]|s0=s]=Eτπ[r0+γVπ(s1)|s0=s]

The proof for the other three value functions are similar.

Temporal Difference

The Bellman equation provides a recursive relationship that allows us to compute value functions iteratively. Given a trajectory τ=(s0,a0,r0,s1,a1,r1,,st,at,rt), and using the Bellman equation Vπ(st)=E[rt+γVπ(st+1)], we define the temporal difference (TD) error as:

δt=rt+γVπ(st+1)Vπ(st)

Value Function Estimation

To estimate Vπ, we often use a neural network Vϕπ to approximate, typically called a value network. There are two main methods to learn the value network.

  • Monte Carlo Sampling: Let R^t=l=tTγltrl be the return of the trajectory since time step t, also called reward-to-go. We directly use the definition of the value function Vπ(st)=E[R^t|st] to compute the value function.
LMC(ϕ)=t=0T(Vϕπ(st)R^t)2,

Notice that the reward-to-go R^t can only be computed by sampling the trajectory from st using the policy π.

You can also estimate the Q function by Monte Carlo sampling.

LMC(φ)=t=0T(Qφπ(st,at)R^t)2,
  • Temporal Difference: We use the Bellman equation and the temporal difference error to learn the value network.
LTD(ϕ)=t=0T(Vϕπ(st)(rt+γVϕπ(st+1)))2,

Similarly, you can estimate the Qπ function by temporal difference.

LTD(φ)=t=0T(Qφπ(st,at)(rt+γmaxaQφπ(st+1,at+1)))2,

You can even estimate the optimal Q function by temporal difference.

LTD(ψ)=t=0T(Qψ(st,at)(rt+γmaxaQψ(st+1,a)))2,