Markov Decision Processes

A Markov Decision Process is a model of a decision making problem where outcomes are uncertain.

State Space $S$. Set of possible states/outcomes with a unique start state $s_0$ and possibly some subset of terminal states.

Action Set $A$. Set of possible actions.

Transition Function $T : \Pr[s \space | \space a, s']$. Probability that an outcome $s \in S$ occurs after taking action $a \in A$ from state $s’ \in S$.

Reward Function $R(s' ; \space a, s)=R(s, a, s')$. The reward $R \in \R$ associated with an action $a \in A$ from $s \in S$ to $s' \in S$.

Such a model is said to be Markovian, meaning that the probability of an outcome only depends on the state and action preceding it.

$$ \Pr[S_{t+1} = s \space | \space S_t = s_t, A_t =a_t, \dots , S_0=s_0] = \Pr[S_{t+1} = s \space | \space S_t = s_t, A_t = a_t] $$

Furthermore, these type of models include living rewards, which are given at each step (i.e. to reward an agent’s survival), and large rewards, which are given for reaching a particular outcome.

Policies

A optimal policy is a function $\pi^* : S \rightarrow A$ which maps each state to the action which maximizes utility.

$$ \pi ^* = \argmax _{\pi \space : \space S\rightarrow A} U[s_0, \pi (s_0), s_1, \pi (s_1), \dots] $$

A policy is said to be stationary if it is not time varying (i.e. it is not a function of the current time step) and non-stationary otherwise.

Q-States

A Q-state $(s, a)$ is an artificial node in the search tree which is used to compute the expected utility across all outcomes of $s$ after an action $a$.