Classifier which predicts the probability of a label given its features, assuming each feature is an independent effect of the label.
$$ \Pr[Y, F_1, \dots, F_n] = \Pr[Y] \prod_{i=1}^n\Pr[F_i \space | \space Y] \\
\hat{y}=\argmax_{y} \Pr[y \space | \space f_1, \dots, f_n] = \argmax _{y}\frac{\Pr[y, f_1, \dots, f_n] }{\sum _{y_i} \Pr[y_i, f_1, \dots, f_n]}
$$
Size of the Model. The joint probability distribution contains $|Y| \cdot |F|^n$ values. As for parameters, $|Y|$ are required for $\Pr[Y]$ and $n \cdot (|F| \cdot |Y|)$ are required for all $n$ conditional probability distributions $\Pr[F_i \space | \space Y]$.
The parameters $\theta$ of the model refer to the prior over the labels $\Pr[Y]$ and the observation model $\Pr[F_i \space | \space Y]$ across each feature, which are learned from training data counts.
How can we estimate $\theta$ based on training data?
Method for estimating the parameters $\theta$ of the unknown distribution $\Pr[ Y \space | \space F_1, \dots, F_n]$ given training data. This estimate, known as the maximum likelihood estimate, arises from maximizing the likelihood of the observed training data.
$$ \theta_{ML} = \argmax_{\theta} \prod_{i} \Pr_{\theta}[X_i] = \argmax_{\theta} \prod_{(y^i, f_1^i, \dots, f^i_n)}\Pr_\theta [y^i, f_1^i, \dots, f^i_n] $$
This is often condensed by the notion of likelihood $\mathcal{L}(\theta)$ which represents the likelihood of observing the training data given our current model.
$$ \mathcal{L}(\theta) = \Pr[\text{All $X_i$ Occurring]=}\prod _{i} \Pr[X_i] \\ \theta _{ML} = \argmax _{\theta} \mathcal{L}(\theta) $$
Solving this optimization problem, under the assumption that training data is i.i.d, produces the following estimate distribution.
$$ \Pr_{{ML}}[y^i, f_1^i, \dots, f_n^i]=\frac{\text{count$(y^i, f_1^i, \dots, f_n^i)$}}{\text{total number of training points}}=\frac{c(y^i, f_1^i, \dots, f_n^i)}{N} \\ \Pr{ML} [x] = \frac{c(x)}{N} $$
Unseen examples during training lead to our model generalizing poorly. Laplace smoothing amends this by “smoothing” the learnt distribution.