Many tasks in machine learning pivot on the problem of finding a hypothesis $h(x)$ within a hypothesis space $\mathcal{H}$ that most closely approximates a target function $g(x)$ given some dataset of input output pairs $(x_i, g(x_i))$.
Typical supervised learning models are characterized by empirical risk minimization, meaning they are optimized to fit a known training dataset in the prospect that this performance will be reflected in an unknown, held-out dataset from the same distribution by the law of large numbers.
The fundamental tradeoff becomes that, with limited training data, the optimal approximate hypothesis $h^$ may not truly approximate $g$ very well. This tradeoff is more particularly between bias (average difference in the prediction of $h^$ and $g$) and variance (variance of bias across different samples of the same distribution).
To reduce variance, we could limit the hypothesis space by simplifying our model, thus allowing for a poorer fit but more room for generalization. Moreover, we could also improve variance by regularizing our predictions through techniques like smoothing or pruning.
Measure for the expected uncertainty of a random variable $X$ given by its distribution in bits.
$$ H(\Pr[X])=H(\begin{bmatrix} p_1 & \dots & p_n \end{bmatrix})=-\sum _{i=1}^np_i\log_2 p_i $$
Tree-like model which partitions a hypothesis space, representing an underlying function.
Each node refers to a certain categorical feature in the dataset. In principle, decisions must not always be deterministic at every node.
This is a bad representation of a truth table, we would hope for our decision trees to be much smaller than the truth table!
In perceptrons, each feature has a contribution which does not depend on any of the other features. The hierarchical structure of decision trees intrinsically encodes this dependence.
Instead of searching the entire hypothesis space for a decision tree, we iteratively search for trees of depth $1$ (stumps), limiting the hypothesis space at each step.