Markov Decision Processes in which the reward and transition functions are unknown make traditional methods for finding a policy infeasible. Reinforcement learning attempts to learn or evaluate a policy within an MDP through exploration across training episodes.
Main Idea. An agent takes actions in an environment, as defined by the underlying MDP, and learns by receiving rewards from observed samples of outcomes.
A sample is a tuple $(s, a, s’, r)$, of which an agent collects many across a single episode, which is the collection of samples from the initial state to a terminal state.
Learn an approximate model of the transition and reward functions based on experiences, then learn the policy through traditional MDP algorithms.
$$ \hat{T}(s,a,s') = \frac{\text{Count of $(s,a,s')$ Observed}}{\text{Count of $(s,a)$ Observed}} $$
$$ \hat{R}(s,a,s') = \text{Reward of $(s, a, s')$ Observed} $$
Formally, we say that $\hat{T}$ is generated by normalizing the count of $(s, a, s’)$.
The law of large numbers dictates that $\hat{T} \rightarrow T$ as more samples are collected.
Instead of learning a model of the MDP, model-free learning learns and evaluates policies through direct samples of the environment.
In passive reinforcement learning, a given policy $\pi$ is fixed and the task is to evaluate it in terms of a learnable $V^*$.
Act according to $\pi$ across many episodes and record the average value of each state.