Inductive Learning

Many tasks in machine learning pivot on the problem of finding a hypothesis $h(x)$ within a hypothesis space $\mathcal{H}$ that most closely approximates a target function $g(x)$ given some dataset of input output pairs $(x_i, g(x_i))$.

Typical supervised learning models are characterized by empirical risk minimization, meaning they are optimized to fit a known training dataset in the prospect that this performance will be reflected in an unknown, held-out dataset from the same distribution by the law of large numbers.

The fundamental tradeoff becomes that, with limited training data, the optimal approximate hypothesis $h^$ may not truly approximate $g$ very well. This tradeoff is more particularly between bias (average difference in the prediction of $h^$ and $g$) and variance (variance of bias across different samples of the same distribution).

To reduce variance, we could limit the hypothesis space by simplifying our model, thus allowing for a poorer fit but more room for generalization. Moreover, we could also improve variance by regularizing our predictions through techniques like smoothing or pruning.

To reduce variance, we could limit the hypothesis space by simplifying our model, thus allowing for a poorer fit but more room for generalization. Moreover, we could also improve variance by regularizing our predictions through techniques like smoothing or pruning.

Entropy

Measure for the expected uncertainty of a random variable $X$ given by its distribution in bits.

$$ H(\Pr[X])=H(\begin{bmatrix} p_1 & \dots & p_n \end{bmatrix})=-\sum _{i=1}^np_i\log_2 p_i $$

Decision Trees

Tree-like model which partitions a hypothesis space, representing an underlying function.

Each node refers to a certain categorical feature in the dataset. In principle, decisions must not always be deterministic at every node.

This is a bad representation of a truth table, we would hope for our decision trees to be much smaller than the truth table!

In perceptrons, each feature has a contribution which does not depend on any of the other features. The hierarchical structure of decision trees intrinsically encodes this dependence.

Decision Tree Learning

Instead of searching the entire hypothesis space for a decision tree, we iteratively search for trees of depth $1$ (stumps), limiting the hypothesis space at each step.