Broadest Overview
In its broadest description, Principal Component Analysis (PCA) is a technique we use to lower the dimensionality of data stored in a data matrix without losing too much information. Consider the following data,
First, we’ll demean this data. I’ll explain why we must do this later, but all this does is ensure that the mean or the geometric “center” of the dataset is at the origin.
We can think of each point in this translated, but equivalent, dataset of $n$ datapoints as a 2 dimensional column vector, all together stacked into a 2 by $n$ matrix. What PCA allows us to do is find the subspace (in this case a line) we can project each column onto that best captures the information of the dataset.
If we project our data matrix onto this subspace, we can see that all of the linearity to our dataset is captured, and, in essence, all of the extraneous information (“noise”) is lost. This very useful application of PCA is called noise reduction.
However, our datapoints are still 2 dimensional! If we change to the basis which spans this subspace, i.e. the basis formed by our “principal component”, the 2 dimensional projection of our data is now captured by an equivalent 1 dimensional representation.
SVD, Formal Attributes On a more formal level, what PCA is really doing is solving the following minimization problem:
In this formulation, we are saying that the subspace we are looking for is the one in which the error between each datapoint $\mathbf{x}_k$ in a dataset $X$ and its projection $WW^T \mathbf{x}_k$ is minimized. And more specifically, to make projection easier, we are looking for the matrix $W$ whose column space is a lower-dimensional subspace spanned by an orthonormal basis.
It turns out we can break this formulation up into two parts:
We want to minimize the error between the projection and the original dataset.
We want to maximize the variance of the dataset upon projection. Variance is a measure of how far, on average, each datapoint is from the mean. But remember, we demeaned our dataset, so the mean is actually the origin! So in this context, these two meanings, variance and distance from the origin, are equivalent and thus they are equally represented by the norm of the projection.
Recall that any matrix $X \in \R^{m \times n}$ has a singular value decomposition $U_r \Sigma_r V^T_r$. It turns out that when the data in our data matrix is stored in columns, the columns of $U_r$ end up spanning this optimal subspace, with the first column being the most important of them all (a corollary of the SVD). If our data was stored in rows, we could take the transpose and show equivalently that the columns of $V_r$ form this optimal subspace. In either case, the vectors which form this optimal subspace are called the principal components of $X$.
Intuition But why believe me? You could certainly sit through the logical deduction in a formal proof that shows this, but I think a more intuitive way of thinking about why $U_r$ contains the principal components is through the outer product form of the SVD. Recall, we can represent $X$ in the following way:
In this form, we can see that each column of $X$ can be represented as a linear combination of the columns of $U_r$. This implies that the fundamental structure encoding information in $X$ is comprised entirely of $C(U_r)$ and truncating $U_r$ to its first $\ell$ columns captures the $\ell$ most important pieces of this structure. Like before, we can show this for $V_r$ by looking at $X^T$.
Application My favorite thing about PCA is that it allows us to visualize clustering in high-dimensional data. If we think about a dataset of audio recordings, each audio recording can be represented as a vector whose entries are the amplitude of the signal at each sampled time step. As you might imagine, each of these recordings can become very high dimensional, and, unlike 2 or 3 dimensional vectors, it becomes less clear which recordings are similar to one another and which aren’t.
If we stack each of these recordings into a matrix where each row is a 1 second long recording, we construct the following matrix:
After demeaning this matrix, we can use PCA to take each of these 44,000 dimensional vectors and project them onto a 2-dimensional subspace spanned by the first two principal components of $S$ (i.e. the first two columns of $V_r$). When we convert to the basis which spans this subspace, i.e. the basis formed by these principal components, we see that audio recordings with similar “profiles” (or waveforms) are clustered together.
Further intuition. The reason this works so well in high dimensions has to do with the fact that every dimension that distinct, featural information is stored in provides a degree of freedom. We can visualize these degrees of freedoms as slots in an ice tray. The higher the dimension, the more slots, and the lower the probability that two unrelated ice trays contain exactly the same amount of water in each slot. However, conversely, this is why we run into what is known as the curse of dimensionality, where the dimensionality of the data intrinsically requires more data for a classifier to be trained adequately.
PCA or Least Squares? The geometric difference is that PCA minimizes the distance between the orthogonal projection onto the principal components and each datapoint, whereas Least Squares minimizes the vertical distance between a fitted line and each datapoint.