Gradient Ascent Stuff

Linear Classifier

Classification models which use a set of parameters and features of the input to learn decision boundaries, hyperplanes within the latent space that divide classes of the dataset. Moreover, a dataset $\mathcal{D}$ is said to be linearly separable if there exist such hyperplanes do so perfectly.

Binary Classifier

Classification model which uses a weight vector $\mathbf{w}$ to model a decision boundary.

$$ \hat{y} =H(\mathbf{w}^T \mathbf{x}) $$

$Typically, we think of each datapoint $\mathbf{x}$ as a vector of features. The notation $\mathbf{f}(x)$ makes this distinction explicit. Additionally, it is convention to take $0$ as being part of the positive class.$

Typically, we think of each datapoint $\mathbf{x}$ as a vector of features. The notation $\mathbf{f}(x)$ makes this distinction explicit. Additionally, it is convention to take $0$ as being part of the positive class.

However, our decision boundary always passes through the origin! The addition of a bias term $b$ lets us shift the boundary away from the origin, making the model more general.

$$ \hat{y} = H(\mathbf{w}^T \mathbf{x}+b) \hspace{15pt} \text{ or } \hspace{15pt} \hat{y} = H(\mathbf{w}^T \mathbf{x}) \text{ with }\mathbf{w} \gets \begin{pmatrix} \mathbf{w} \\ b\end{pmatrix} \text{ and } \mathbf{x} \gets \begin{pmatrix} \mathbf{x} \\ 1\end{pmatrix} $$

Multiclass Classifier

Classification model which uses a weight vector $\mathbf{w}_y$ for each class $y$ to model decision boundaries partitioning the latent space into a type of Voronoi diagram.

$$ \hat{y} = \argmax_{y} \mathbf{w}_y ^T \mathbf{x} $$

Similar to before, we add a bias term to each activation for generality and we can perform the same stacking trick to simplify the expression (if we want to).

$$ \hat{y} =\argmax _y \mathbf{w_y}^T \mathbf{x} +b_y $$