not really worth rewriting but may show up on final

Naive Bayes

Classifier which predicts the probability of a label given its features, assuming each feature is an independent effect of the label.

$$ \Pr[Y, F_1, \dots, F_n] = \Pr[Y] \prod_{i=1}^n\Pr[F_i \space | \space Y] \\

\hat{y}=\argmax_{y} \Pr[y \space | \space f_1, \dots, f_n] = \argmax _{y}\frac{\Pr[y, f_1, \dots, f_n] }{\sum _{y_i} \Pr[y_i, f_1, \dots, f_n]}

$Size of the Model. The joint probability distribution contains $|Y| \cdot |F|^n$ values. As for parameters, $|Y|$ are required for $\Pr[Y]$ and $n \cdot (|F| \cdot |Y|)$ are required for all $n$ conditional probability distributions $\Pr[F_i \space | \space Y]$.$

Size of the Model. The joint probability distribution contains $|Y| \cdot |F|^n$ values. As for parameters, $|Y|$ are required for $\Pr[Y]$ and $n \cdot (|F| \cdot |Y|)$ are required for all $n$ conditional probability distributions $\Pr[F_i \space | \space Y]$.

The parameters $\theta$ of the model refer to the prior over the labels $\Pr[Y]$ and the observation model $\Pr[F_i \space | \space Y]$ across each feature, which are learned from training data counts.

Parameter Estimation

How can we estimate $\theta$ based on training data?

Maximum Likelihood Estimation

Method for estimating the parameters $\theta$ of the unknown distribution $\Pr[ Y \space | \space F_1, \dots, F_n]$ given training data. This estimate, known as the maximum likelihood estimate, arises from maximizing the likelihood of the observed training data.

$$ \theta_{ML} = \argmax_{\theta} \prod_{i} \Pr_{\theta}[X_i] = \argmax_{\theta} \prod_{(y^i, f_1^i, \dots, f^i_n)}\Pr_\theta [y^i, f_1^i, \dots, f^i_n] $$

This is often condensed by the notion of likelihood $\mathcal{L}(\theta)$ which represents the likelihood of observing the training data given our current model.

$$ \mathcal{L}(\theta) = \Pr[\text{All $X_i$ Occurring]=}\prod _{i} \Pr[X_i] \\ \theta _{ML} = \argmax _{\theta} \mathcal{L}(\theta) $$

Solving this optimization problem, under the assumption that training data is i.i.d, produces the following estimate distribution.

$$ \Pr_{{ML}}[y^i, f_1^i, \dots, f_n^i]=\frac{\text{count$(y^i, f_1^i, \dots, f_n^i)$}}{\text{total number of training points}}=\frac{c(y^i, f_1^i, \dots, f_n^i)}{N} \\ \Pr{ML} [x] = \frac{c(x)}{N} $$

Derivation

Laplace Smoothing

Unseen examples during training lead to our model generalizing poorly. Laplace smoothing amends this by “smoothing” the learnt distribution.