In the realm of Reinforcement Learning (RL), policies and transition functions play crucial roles in guiding an agent’s decision-making process through various states to maximize rewards. This blog post will dive deep into the mechanics of both stochastic and deterministic policies, the nature of transition functions, and how they collectively contribute to determining the best strategies in RL. We will explore the foundational mathematics behind these concepts, particularly focusing on the Bellman equations which provide a recursive decomposition of the decision-making process.
Policies in Reinforcement Learning
A policy, denoted as π, is a strategy that defines the action a an agent takes in a given state s. There are two types of policies:
- Stochastic Policy: Here, the policy π(a∣s) is a probability distribution over actions given the state. This implies that the action taken by the agent is random, guided by the probability distribution defined by the policy.
- Deterministic Policy: In contrast, a deterministic policy always selects the same action for a specific state, i.e.,a=π(s). This means the action is strictly defined and does not involve any randomness for given states.
Transition Functions in Reinforcement Learning
The transition function defines the probability of transitioning from one state to another after taking a certain action. It can be described as:
- Deterministic Transition: Here, the next state s′ is uniquely determined by the current state s and the action a.
- Stochastic Transition: In this case, the transition to the next state s′ is probabilistic, influenced by the current state and the action but also involves inherent randomness or other factors affecting the environment.
Putting them together using a gird example

Value Functions and Bellman Equations
The core of understanding how policies and transitions work together in RL is encapsulated by the Bellman equations. Let’s begin by defining the key components:
- Value Function Vπ(s): This function represents the expected return (accumulated reward) when starting from state s and following policy π.
- Action-Value Function qπ(s,a): This describes the expected return starting from state s, taking action a, and thereafter following policy π.
The relationship between these two is given by the following expectation:
Vπ(s)=∑aπ(a∣s)qπ(s,a)
Here,qπ(s,a) can be further decomposed using the Bellman equation for action values:
qπ(s,a)=Ra(s)+γ∑s′P(s′∣s,a)Vπ(s′)
Where:
- Ra(s) is the reward received after taking action a in state s.
- γ is the discount factor which represents the difference in importance between future rewards and immediate rewards.
- P(s′∣s,a) is the probability of transitioning to state s′ from state s after taking action a.
Conclusion
By understanding the intricate relationship between policies, transition functions, and value functions as described by the Bellman equations, one can grasp the foundational principles governing decision-making in reinforcement learning. This framework not only helps in formulating strategies but also in predicting future states and rewards, ultimately guiding the agent towards optimal behavior in complex environments.
Feel free to delve deeper into specific algorithms and examples in subsequent posts to see these principles in action. Happy learning!