I was familiar with most of this content before taking the course so, along with the time constraint of taking a summer course, the coverage here is very limited.

Inferencing

Input query is tokenized, then each token is converted to an integer, used to look up an embedding within the learnt latent space. This representation is then used to predict the embedding of the next token, which is subsequently decoded into human-readable text.

Not a horrible simplification.

Attention

Instead of conditioning on all tokens equally, condition more on relevant tokens!

Autoregressive generation of next token $x_6$.

Autoregressive generation of next token $x_6$.

Single-Headed Attention

Interesting way to view it.

Multi-Headed Attention

Transformer Architecture

Reinforcement Learning from Human Feedback

Technique which models sentence generation using an MDP, where rewards are decided using human feedback.