Attention, please!

Demystifying the math behind the attention mechanism and the transformer model


Full transformer architecture
PyTorch implementation here (or, with comments)



In this article


Abstract

This article provides an intuitive introduction to some of the fundamental components of modern machine learning: embeddings, attention mechanism, and the transformer model.

Embeddings map tokens (generally, objects) into continuous vector spaces where geometric relations encode semantic and syntactic structure. Since meaning often depends on context, representations must adapt to surrounding tokens. The attention mechanism enables this by weighting relevant information across the entire sequence. Queries, keys, and values specify how information is accessed and combined, and multiple attention heads allow to capture diverse patterns of interaction.

The transformer model builds on this mechanism to map an input sequence to an output sequence. Its encoder constructs a contextual representation of the input, while its decoder generates the corresponding output step by step. Positional encodings inject the notion of order, ensuring that sequences are not treated as unordered sets of tokens. Together, these components form a flexible and powerful architecture for modeling complex dependencies in sequence data.



Embeddings

Computers understand the meaning of words and the relationships between them through embeddings and embedding spaces. Embeddings are a way of representing objects as points in a continuous vector space, where the locations of the points are semantically meaningful.

In the context of language, the objects being represented are pieces of text, most often words. To efficiently embed a word means to map it into a space where words with similar meanings are represented by vectors that lie close together, while unrelated words are positioned far apart according to a chosen distance.

Let us consider the classic example of embedding the words king, queen, man and woman. A good $5$-dimensional embedding of the words could be the following:

\[\begin{aligned} &{\vec{v}_{king}} \hspace{-19pt}&= [\,0.9,\; 0.8,\; 0.7,\; 0.1,\; 0.2\,] \\ &{\vec{v}_{queen}} \hspace{-19pt}&= [\,0.9,\; 0.8,\; 0.7,\; 0.1,\; 0.7\,] \\ &{\vec{v}_{man}} \hspace{-19pt}&= [\,0.8,\; 0.6,\; 0.6,\; 0.2,\; 0.1\,] \\ &{\vec{v}_{woman}} \hspace{-19pt}&= [\,0.8,\; 0.6,\; 0.6,\; 0.2,\; 0.6\,] \end{aligned}\]

Let us consider the cosine similarity to measure how close two vectors are.

\[\text{cosine similarity}(\vec{v}, \vec{w}) = \frac{\vec{v} \cdot \vec{w}}{\|\vec{v}\| \, \|\vec{w}\|} = \frac{\sum_{i=1}^n v_i\,w_i}{\sqrt{\sum_{i=1}^n v_i^2} \sqrt{\sum_{i=1}^n w_i^2}}\] \[\begin{align} &\text{cosine similarity}(\vec{v}_{king}, \vec{v}_{queen}) \hspace{-19pt}&\approx 0.98 \\ &\text{cosine similarity}(\vec{v}_{man}, \vec{v}_{woman}) \hspace{-19pt}&\approx 0.99 \end{align}\] \[\begin{align} &\vec{v}_{queen} \hspace{-20pt}&-\hspace{1pt} \vec{v}_{king} &= [\,0,\; 0,\; 0,\; 0,\; 0.5\,] \\ &\vec{v}_{woman} \hspace{-20pt}&- \hspace{1pt} \vec{v}_{man} &= [\,0,\; 0,\; 0,\; 0,\; 0.5\,] \end{align}\] \[\vec{v}_{king} - \vec{v}_{man} + \vec{v}_{woman} = \vec{v}_{queen}\]

In practice, embeddings are generated by passing text through a model trained to represent linguistic meaning in a continuous vector space. The text is first preprocessed, $i.e.$ cleaned of extraneous symbols and split into tokens (words or subwords). Then, each token is converted by the model into a numerical vector.


Attention Mechanism

As mentioned, to embed a text we divide it into tokens (for simplicity, words), and we get the embedding vector of each token. A text can therefore be seen as a sequence of vectors.

Given a fixed embedding model, the process of embedding a specific word always produces the same vector, independently of the sentence’s context. This gets particularly tricky with polysemous words (words with multiple meanings). The word light can refer to illumination, but also to something that is not heavy. The word park can refer to the outdoor area, but also to the act of leaving a vehicle. The word right can refer to something that is correct or true, but can also be a direction, or even a legal entitlement.

If a model has to understand a word in a sentence, then the mathematical representation of the word (embedding vector) has to adapt to the context of the sentence.

The attention mechanism is a technique to add context to a given sequence representation. It is a way of creating contextualized embeddings. This is achieved by allowing each token in the sequence to look at every other token, and adjust its representation based on the relevant information it gathers.


A practical example.

Consider the sentence Bank of the river and a generic embedding model that outputs vectors in $\mathbb{R}^n$. Let each word be a token, so that the embedded sentence is represented by the sequence of vectors $[\vec{v}_1, \vec{v}_2, \vec{v}_3, \vec{v}_4]$, where $\vec{v}_1$ represents Bank, $\vec{v}_2$ represents of, etc.

Since the words have been embedded independently into a generic embedding space, it is likely that $\vec{v}_1$ is in proximity of vectors that represent money-themed words. However, in this scenario, the word Bank refers to the river, so it would be more contextually meaningful if $\vec{v}_1$ was closer to the group of vectors associated with water.

The goal is to improve the sentence representation by creating a new sequence of vectors \([\vec{y}_1, \hspace{2pt}\vec{y}_2, \hspace{2pt}\vec{y}_3, \hspace{2pt}\vec{y}_4]\) where each vector \(\vec{y}_i\) is a better representation of its corresponding \(\vec{v}_i\) vector.

The vector \(\vec{y}_1\) is better than \(\vec{v}_1\) at representing the word Bank for this specific sentence if it is able to incorporate the context of water in itself. To do so, \(\vec{y}_1\) needs to allow every token of the sentence to have a say in its final representation. Mathematically speaking, this means that $\vec{y}_1$ is a weighted sum of $\vec{v}_1, \vec{v}_2, \vec{v}_3, \vec{v}_4$:

\[\vec{y}_1 = w_{11} \vec{v}_1 + w_{12} \vec{v}_2 + w_{13} \vec{v}_3 + w_{14} \vec{v}_4\]

The weight $w_{14}$ says how much the vector $\vec{v}_4$ should influence the original vector $\vec{v}_1$ when forming the new contextualized representation. How to establish the value of these weights? The simplest way is to set them as the dot product between the two vectors involved.

\[w_{1j} = \vec{v}_1 \cdot \vec{v}_j = \sum_{k=1}^n v_{1_k} \, v_{j_k} = \|\vec{v}_1\| \, \|\vec{v}_j\| \, \cos(\theta_{1j}) \qquad j=1, .., 4\]

The dot product between two vectors measures how aligned they are, $i.e.$ how much one vector points in the same direction as the other in the embedding space. Here it arises an important question. The vector representing bank currently means money, and we want it to mean water. Therefore, we want $w_{14}$ to be high. However, if the vectors bank and river are not aligned, or worse, if they are perpendicular, then $\cos(\theta_{14}) = 0$ and $w_{14}=0$. How can we be sure that river is able to influence bank? We will come back to this matter later.

For the moment, let us proceed with the simple dot product between the embedding vectors. A common practice is to normalize the set of weights to have $\small\sum_{j}\normalsize w_{1j} = 1$. In this way, the weights can be interpreted as a distribution over the elements of the sequence. The normalization is done by the softmax operation and, if we denote with $L$ the length of the sequence (in this case $L=4$), we can then write:

\[w_{1j} = \frac{\exp(\vec{v}_1 \cdot \vec{v}_j)}{\sum_{\ell=1}^L \exp(\vec{v}_1 \cdot \vec{v}_{\ell})} \qquad j = 1, .., L\]

For simplicity, we will shortly proceed writing the weights in a non-normalized form:

\[w_{1j} = \vec{v}_1 \cdot \vec{v}_j \qquad j = 1, .., L\]

Let us now focus on the formulation of $\vec{y}_1$ and introduce some notations. The formulation:

\[\vec{y}_1 = w_{11} \vec{v}_1 + w_{12} \vec{v}_2 + w_{13} \vec{v}_3 + w_{14} \vec{v}_4\]

is the compressed version of:

\[\vec{y}_1 = (\vec{v}_1 \cdot \vec{v}_1) \hspace{2pt}\vec{v}_1 + (\vec{v}_1 \cdot \vec{v}_2) \hspace{2pt}\vec{v}_2 + (\vec{v}_1 \cdot \vec{v}_3) \hspace{2pt}\vec{v}_3 + (\vec{v}_1 \cdot \vec{v}_4) \hspace{2pt}\vec{v}_4\]

The vector $\vec{v}_1$ is used in three different scenarios:

\[\vec{y}_1 = (\textcolor{Red}{\vec{v}_1} \cdot \textcolor{ForestGreen}{\vec{v}_1}) \hspace{2pt}\textcolor{RoyalBlue}{\vec{v}_1} + (\textcolor{Red}{\vec{v}_1} \cdot \vec{v}_2) \hspace{2pt}\vec{v}_2 + (\textcolor{Red}{\vec{v}_1} \cdot \vec{v}_3) \hspace{2pt}\vec{v}_3 + (\textcolor{Red}{\vec{v}_1} \cdot \vec{v}_4) \hspace{2pt}\vec{v}_4\]

The query is the vector we are updating, the values provide the information to update the query, and the keys determine the relevance of the values’ information to the query.

We want to create a better representation for each $\vec{v}_i$, not just $\vec{v}_1$:

\[\displaylines{\vec{y}_1 = (\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_1}\color{black} + (\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_2}\color{black} + (\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_3}\color{black} + (\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_4}\color{black}\\ \vec{y}_2 = (\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_1}\color{black} + (\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_2}\color{black} + (\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_3}\color{black} + (\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_4}\color{black}\\ \vec{y}_3 =(\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_1}\color{black} + (\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_2}\color{black} + (\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_3}\color{black} + (\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_4}\color{black} \\ \vec{y}_4 = (\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_1}\color{black} + (\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_2}\color{black} + (\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_3}\color{black} + (\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black}) \hspace{2pt}\color{RoyalBlue}{\vec{v}_4}\color{black}}\\\]

We can group the equations and rewrite everything as:

\[\begin{bmatrix}\hspace{2pt}\vec{y}_1 \\ \hspace{2pt}\vec{y}_2 \\ \hspace{2pt}\vec{y}_3 \\ \hspace{2pt}\vec{y}_4 \end{bmatrix} =\begin{bmatrix}(\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) & (\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) & (\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) & (\color{Red}{\vec{v}_1}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black})\\(\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) & (\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) & (\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) & (\color{Red}{\vec{v}_2}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black})\\(\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) & (\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) & (\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) & (\color{Red}{\vec{v}_3}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black})\\(\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_1}\color{black}) & (\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_2}\color{black}) & (\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_3}\color{black}) & (\color{Red}{\vec{v}_4}\color{black} \cdot \color{ForestGreen}{\vec{v}_4}\color{black})\end{bmatrix}\begin{bmatrix}\hspace{2pt}\color{RoyalBlue}{\vec{v}_1}\color{black} \\ \hspace{2pt}\color{RoyalBlue}{\vec{v}_2}\color{black} \\ \hspace{2pt}\color{RoyalBlue}{\vec{v}_3}\color{black} \\ \hspace{2pt}\color{RoyalBlue}{\vec{v}_4}\color{black}\end{bmatrix}\]

Then, rewrite it again as:

\[\begin{bmatrix} \hspace{2pt}\vec{y}_1 \\ \hspace{2pt}\vec{y}_2 \\ \hspace{2pt}\vec{y}_3 \\ \hspace{2pt}\vec{y}_4 \end{bmatrix} = \begin{bmatrix} \hspace{2pt}\color{Red}{\vec{v}_1}\color{black}\\ \hspace{2pt}\color{Red}{\vec{v}_2}\color{black}\\ \hspace{2pt}\color{Red}{\vec{v}_3}\color{black}\\ \hspace{2pt}\color{Red}{\vec{v}_4}\color{black} \end{bmatrix} \begin{bmatrix} \hspace{2pt}\color{ForestGreen}{\vec{v}_1}\color{black}\;\; \hspace{2pt}\color{ForestGreen}{\vec{v}_2}\color{black}\;\; \hspace{2pt}\color{ForestGreen}{\vec{v}_3}\color{black}\;\; \hspace{2pt}\color{ForestGreen}{\vec{v}_4}\color{black} \end{bmatrix} \begin{bmatrix} \hspace{2pt}\color{RoyalBlue}{\vec{v}_1}\color{black}\\ \hspace{2pt}\color{RoyalBlue}{\vec{v}_2}\color{black}\\ \hspace{2pt}\color{RoyalBlue}{\vec{v}_3}\color{black}\\ \hspace{2pt}\color{RoyalBlue}{\vec{v}_4}\color{black} \end{bmatrix}\]

If we now expand all vectors we get:

\[{\begin{bmatrix}\hspace{2pt}{y}_{1_1} \;\; {y}_{1_2} \;\; .. \;\; y_{1_n} \\ \hspace{2pt}{y}_{2_1} \;\; y_{2_2}\;\; .. \;\; y_{2_n} \\ \hspace{2pt}{y}_{3_1} \;\; y_{3_2} \;\; .. \;\; y_{3_n} \\ \hspace{2pt}{y}_{4_1} \;\; y_{4_2} \;\; .. \;\; y_{4_n} \end{bmatrix} =\begin{bmatrix}\hspace{2pt}\color{Red}{v}_{1_1} \;\; v_{1_2} \;\; .. \;\; \color{Red}v_{1_n} \\\hspace{2pt}\color{Red}{v}_{2_1} \;\; v_{2_2} \;\; \color{Red}.. \;\; \color{Red}v_{2_n} \\\hspace{2pt}\color{Red}{v}_{3_1} \;\; v_{3_2} \;\;.. \;\; v_{3_n} \\\hspace{2pt}\color{Red}{v}_{4_1} \;\; \color{Red} v_{4_2} \;\; .. \;\; \color{Red}v_{4_n} \end{bmatrix}\begin{bmatrix}\hspace{2pt}\color{ForestGreen}{v}_{1_1} \;\; \color{ForestGreen}v_{2_1} \;\; \color{ForestGreen}v_{3_1} \;\; \color{ForestGreen}v_{4_1} \\\hspace{2pt}\color{ForestGreen}{v}_{1_2} \;\; \color{ForestGreen}v_{2_2} \;\; \color{ForestGreen}v_{3_2} \;\; \color{ForestGreen}v_{4_2} \\ \;\; \color{ForestGreen}.. \hspace{9pt} \color{ForestGreen}.. \hspace{9pt} \color{ForestGreen}.. \hspace{9pt} \color{ForestGreen}.. \;\; \\\hspace{2pt}\color{ForestGreen}{v}_{1_n} \;\; \color{ForestGreen}v_{2_n} \;\; \color{ForestGreen}v_{3_n} \;\; \color{ForestGreen}v_{4_n} \end{bmatrix}\begin{bmatrix}\hspace{2pt}\color{RoyalBlue}{v}_{1_1} \;\; v_{1_2} \;\; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{1_n} \\\hspace{2pt}\color{RoyalBlue}{v}_{2_1} \;\; v_{2_2} \;\; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{2_n} \\\hspace{2pt}\color{RoyalBlue}{v}_{3_1} \;\; v_{3_2} \;\; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{3_n} \\\hspace{2pt}\color{RoyalBlue}{v}_{4_1} \;\; v_{4_2} \;\; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{4_n} \end{bmatrix}}\]

Equivalent to:

\[{\underbrace{ \begin{bmatrix} \hspace{2pt}{y}_{1_1} \;\; y_{1_2} \;\; .. \;\; y_{1_n} \\ \hspace{2pt}{y}_{2_1} \;\; y_{2_2} \;\; .. \;\; y_{2_n} \\ \hspace{2pt}{y}_{3_1} \;\; y_{3_2} \;\; .. \;\; y_{3_n} \\ \hspace{2pt}{y}_{4_1} \;\; y_{4_2} \;\; .. \;\; y_{4_n} \end{bmatrix}}_{\color{black}\text{Attention}}=\underbrace{\begin{bmatrix} \hspace{2pt}\color{Red}{v}_{1_1} \;\; \color{Red}v_{1_2} \;\; \color{Red}.. \;\; \color{Red}v_{1_n} \\ \hspace{2pt}\color{Red}{v}_{2_1} \;\; \color{Red}v_{2_2} \;\; \color{Red}.. \;\; \color{Red}v_{2_n} \\ \hspace{2pt}\color{Red}{v}_{3_1} \;\; \color{Red}v_{3_2} \;\; \color{Red}.. \;\; \color{Red}v_{3_n} \\ \hspace{2pt}\color{Red}{v}_{4_1} \;\; \color{Red}v_{4_2} \;\; \color{Red}.. \;\; \color{Red}v_{4_n} \end{bmatrix}}_{\color{Red}\text{Queries matrix (Q)}} \;\underbrace{ \begin{bmatrix} \hspace{2pt}\color{ForestGreen}{v}_{1_1} \;\; \color{ForestGreen}v_{1_2} \;\; \color{ForestGreen}.. \;\; \color{ForestGreen}v_{1_n} \\ \hspace{2pt}\color{ForestGreen}{v}_{2_1} \;\; \color{ForestGreen}v_{2_2} \;\; \color{ForestGreen}.. \;\; \color{ForestGreen}v_{2_n} \\ \hspace{2pt}\color{ForestGreen}{v}_{3_1} \;\; \color{ForestGreen}v_{3_2} \;\; \color{ForestGreen}.. \;\; \color{ForestGreen}v_{3_n} \\ \hspace{2pt}\color{ForestGreen}{v}_{4_1} \;\; \color{ForestGreen}v_{4_2} \;\; \color{ForestGreen}.. \;\; \color{ForestGreen}v_{4_n} \end{bmatrix}^T}_{\color{ForestGreen}\text{Keys matrix (K}\color{black}^T\color{ForestGreen}\text{)}} \;\underbrace{ \begin{bmatrix} \hspace{2pt}\color{RoyalBlue}{v}_{1_1} \;\; \color{RoyalBlue}v_{1_2} \; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{1_n} \\ \hspace{2pt}\color{RoyalBlue}{v}_{2_1} \;\; \color{RoyalBlue}v_{2_2} \;\; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{2_n} \\ \hspace{2pt}\color{RoyalBlue}{v}_{3_1} \;\; \color{RoyalBlue}v_{3_2} \;\; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{3_n} \\ \hspace{2pt}\color{RoyalBlue}{v}_{4_1} \;\; \color{RoyalBlue}v_{4_2} \;\; \color{RoyalBlue}.. \;\; \color{RoyalBlue}v_{4_n} \end{bmatrix}}_{\color{RoyalBlue}\text{Values matrix (V)}} \;}\]

This translates into the following attention formulation:

\[\text{Attention}(\textcolor{Red}{Q}, \textcolor{ForestGreen}{K}, \textcolor{RoyalBlue}{V}) = \textcolor{Red}{Q} \textcolor{ForestGreen}{K}^T \textcolor{RoyalBlue}{V}\]

The Attention matrix is composed of a set of vectors \(\{\vec{y}_i\}_i\) such that every $\vec{y}_i$ is a better (more contextualized) representation of the corresponding vector $\vec{v}_i$.

At this point, we remember that we temporarily simplified the normalization operation in the computation of the weights. The correct attention formulation thus is given by:

\[\text{Attention}(\textcolor{Red}{Q}, \textcolor{ForestGreen}{K}, \textcolor{RoyalBlue}{V}) = \text{softmax}(\textcolor{Red}{Q}, \textcolor{ForestGreen}{K}^T) \hspace{2pt}\textcolor{RoyalBlue}{V}\]

The original paper introduces an additional scaling factor of $1/{\sqrt{d_k}}$ for numerical stability, where $d_k$ is the dimension of the key vectors.

Attention mechanism (scaled dot-product attention) $$ \text{Attention}( \textcolor{Red}{Q}, \textcolor{ForestGreen}{K}, \textcolor{RoyalBlue}{V} ) = \text{softmax}\!\left( \frac{\textcolor{Red}{Q}\,\textcolor{ForestGreen}{K}^{T}}{\sqrt{d_k}} \right) \textcolor{RoyalBlue}{V} $$

We can rename the query vectors as \(\{\textcolor{Red}{\vec{q}_i}\}_{\color{Red}i}\), the keys as \(\{\textcolor{ForestGreen}{\vec{k}_i}\}_{\color{ForestGreen}i}\), and leave the values as \(\{\textcolor{RoyalBlue}{\vec{v}_i}\}_{\color{RoyalBlue}i}\). With this nomenclature, the equations for every $\vec{y}_i$ becomes:

\[{\vec{y}_i = \sum_{j=1}^{L} \left({\frac{\exp\large\left(\frac{\textcolor{Red}{\vec{q\vphantom{k}_i}}\, \cdot\, \textcolor{ForestGreen}{\vec{k}_{j}}}{\sqrt{d_k}}\right)}{\normalsize\sum_{\ell=1}^L \exp\large\left(\frac{\textcolor{Red}{\vec{q\vphantom{k}_i}} \,\cdot\, \textcolor{ForestGreen}{\vec{k}_{\ell}}}{\sqrt{d_k}}\right)}}\right)\hspace{2pt} \normalsize \textcolor{RoyalBlue}{\vec{v}_{j}}}\]


Multi-Head Attention

By defining the weights \(\{w_{ij}\}_{i,j}\) as a dot product (normalized or not) of the plain original embedding vectors we hit a strong limitation. How can we be sure that the right vectors will influence the new representation? In the previous example, how can we be sure that river will have a strong influence in the new representation of bank? If the vectors $\vec{v}_1$ and $\vec{v}_4$ are almost perpendicular, then river will have little to none influence in the making of $\vec{y}_1$.

To address this, the vectors \(\{\vec{v}_i\}_i\) are not directly used as they are in the attention mechanism (arranged in $Q$, $K$, $V$), they are first transformed into different representations through linear projections that are learned.

The original embedding vectors are arranged into $Q$, $K$, $V \in \mathbb{R}^{L\times d_{\text{model}}}$, $L$ being the length of the sequence (number of words) and $d_{\text{model}}$ the dimension of the original embedding space. We then introduce $W_Q$, $W_K$, $W_V \in \mathbb{R}^{d_{\text{model}}\times d_k}$, where $d_k$ is the dimension of the key vectors. These are learned linear transformations. During the training, the goal is to adjust these matrices so that the transformed queries, keys, and values produce meaningful attention outputs. The optimization process maximizes the alignment between queries and keys that are contextually relevant. This ensures that, for example, the query for bank aligns strongly with the key river. The attention mechanism is then computed on the projected version of the query, key, and vector matrices.

\[\text{Attention}(QW_Q, KW_K, VW_V)\]

where $QW_Q$ is the linear projection of $Q$ according to the mapping represented by $W_Q$, etc.

A single set of projections, followed by the attention mechanism, might end up focusing and capturing a single type of relationship between words. In our case, a single set of projections might solve the representation of bank, but what if there are more relationships to exploit? Multi-head attention solves this by introducing multiple sets of projections in parallel. Instead of having one single Attention matrix, denoted as head, multi-head attention has $H$:

\[\text{head}_h = \text{Attention}(QW_Q^h,\,KW_K^h, VW_V^h) \qquad h=1,.., H\]

where each head $h$ has its own projection matrices $W_Q^h$, $W_K^h$, $W_V^h$. Then, the outputs of all heads are concatenated and linearly transformed (projected) one last time:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, .. \text{head}_H)\,W_0\]

where $W_0 \in \mathbb{R}^{H \cdot d_k \times d_{\text{model}}}$ is a learned matrix that brings the concatenated outputs back to the original embedding dimension $d_{\text{model}}$. This final projection allows the model to integrate the information captured by each head into a single representation for each token.

Multi-head attention $$ \begin{aligned} \text{MultiHead}(Q, K, V) &= \text{Concat}(\text{head}_1, .., \text{head}_H)\, W_O \\[2pt] \text{head}_h &= \text{Attention}(Q W_Q^h, K W_K^h, V W_V^h) \quad h = 1, .., H \end{aligned} $$

Different projections of $Q$, $K$, and $V$ will produce different attention outputs. Intuitively, each attention head can focus on a different type of relationship or pattern in the sequence. For example, one head might learn to capture syntactic dependencies ($e.g.$ subject-verb agreement), another semantic associations ($e.g.$ bank and river), and another could focus on positional relationships or long-range dependencies. The more the heads, the higher the change to capture all the meaningful hidden relationships within the sequence.


Positional Encoding

It is now clear how to construct a contextualized representation of a sentence. However, we have neglected valuable information: the position of each token (word) within the sentence. When creating a better representation of the $i$-th word, we allow every other word to provide context, but we treat them collectively as a set of words rather than an ordered sequence.

Let us consider the sentences The man hit the car and The car hit the man. Although the words are exactly the same, the two sentences have different meanings. If we want a model to understand the two sentences differently, we need to embed them differently. With the multi-head attention we can hope that one of the heads will learn and encode the positional relationship, but this is just a hope, while we can incorporate the information explicitly.

Let us go back to the case of single head attention, and plain un-normalized dot product for the weights. Let us consider a tokenizer that creates the tokens the man, hit, the car. For both sentences we first embed each token, obtaining \(\vec{v}_{man}\), \(\vec{v}_{hit}\), \(\vec{v}_{car}\). Then, to get the contextualized version of ($e.g.$) $\vec{v}_{man}$ we compute:

\[\begin{align} \vec{y}_{man}^{\hspace{4pt}\text{sentence } 1} & = w_{man,\, man}\, \vec{v}_{man} & \hspace{-20pt} + w_{man,\, hit}\, \vec{v}_{hit} & + w_{man,\, car}\, \vec{v}_{car} \\[0pt] \vec{y}_{man}^{\hspace{4pt}\text{sentence } 2} & = w_{man,\, car}\, \vec{v}_{car} & \hspace{-20pt} + w_{man,\, hit}\, \vec{v}_{hit} & + w_{man,\, man}\, \vec{v}_{man} \end{align}\]

The two formulations are identical, $i.e.$ the final contextualized embedded sentences (in this simplified setup) will be the same. This implies that the representation of the sentence does not provide the information on whether the man hit the car, or the other way around.

To incorporate the positional information of the token in the representation, we generate unique position vectors that match the dimensionality of the token embeddings, and we add them to the un-contextualized token embeddings.

\[\vec{z}_i = \vec{v}_i + \vec{p}_i \qquad i=1, .., L\]

Why addition instead of $e.g.$ concatenate or multiply? This is not only for convenience, since addition keeps the representation dimension $d_{\text{model}}$ unchanged, but also because addition encourages the model to treat position as a feature of the token, rather than a separate entity.

There are different types of position vectors (or, positional encodings): they can be learnable or fixed. The paper introduces sinusoidal positional encodings, a set of pre-defined encodings based on sine and cosine functions.

Sinusoidal positional encodings $$ \begin{aligned} \text{PE}(pos, \,2i) &= \sin(pos /10000^{2i/d_{\text{model}}} )\\[2pt] \text{PE}(pos, \,2i+1) &= \cos(pos /10000^{2i/d_{\text{model}}} ) \end{aligned} $$

where $pos$ is the position of the token in the sequence, $i$ is the dimension index (within the vector), and $d_{model}$ is the dimension of the original embedding space.

Going back to our example. Suppose that $d_{\text{model}} = 4$, and that the original embeddings are:

\[\begin{aligned} &{\vec{v}_{man}} \hspace{-19pt}&= [\,1,\; 0,\; 0,\; 0\,] \\ &{\vec{v}_{hit}} \hspace{-19pt}&= [\,0,\; 1,\; 0,\; 0\,] \\ &{\vec{v}_{car}} \hspace{-19pt}&= [\,0,\; 0,\; 1,\; 0\,] \end{aligned}\]

Since the dimension of the embedding space is $d_{\text{model}} = 4$, the available values for $i$ are $0$ and $1$: $i=0$ produces the positions $2i = 0$, $2i+1 = 1$, while $i=1$ produces the positions $2i =2$, and $2i+1 = 3$. Therefore:

\[\begin{aligned} i=0:\qquad &\text{PE}(pos, 0) = \sin(pos /10000^{0/d_{\text{model}}} ) = \sin(pos)\\ &\text{PE}(pos, 1) = \cos(pos /10000^{0/d_{\text{model}}} ) = \cos(pos)\\ i=1:\qquad&\text{PE}(pos, 2) = \sin(pos /10000^{2/d_{\text{model}}} ) = \sin(pos/{100})\\ &\text{PE}(pos, 3) = \cos(pos /10000^{2/d_{\text{model}}} ) = \cos({pos}/{100}) \end{aligned}\]

Both sentences have three tokens. The positional encodings for the tokens are independent of the content of the tokens, and are computed as follows:

\[\begin{aligned} pos=1:\qquad \vec{p}_{1} &= [\,\text{PE}(1,0),\; \text{PE}(1,1),\; \text{PE}(1,2),\; \text{PE}(1,3)\,] \\ &= [\,\sin(1),\; \cos(1),\; \sin(1/100),\; \cos(1/100)\,] \\ &= [\,0.8415,\; 0.5403,\; 0.0100,\; 0.9999\,]\\ pos=2:\qquad \vec{p}_2 &= [\,0.9093,\; -0.4161,\; 0.0200,\; 0.9998\,]\\ pos=3:\qquad \vec{p}_3 &= [\,0.1411,\; -0.9899,\; 0.0300,\; 0.9996\,] \end{aligned}\]

With these positional encodings we can now compute the embeddings of the tokens taking into account the position of the token within the sentence.

\[\begin{aligned} \text{sentence }1:\qquad \vec{z}_{man} &= \vec{v}_{man} \hspace{-20pt} &+ \vec{p}_1 &= [\,1.8415,\; 0.5403,\; 0.0100,\; 0.9999\,]\\ \vec{z}_{hit} &= \vec{v}_{hit} \hspace{-20pt} &+ \vec{p}_2 &= [\,0.9093,\; 0.5839,\; 0.0200,\; 0.9998\,]\\ \vec{z}_{car} &= \vec{v}_{car} \hspace{-20pt} &+ \vec{p}_3 &= [\,0.1411,\; -0.9899,\; 1.0300,\; 0.9996\,]\\[0pt] \text{sentence }2:\qquad \hspace{4pt}\vec{z}_{car} &= \vec{v}_{car} \hspace{-20pt} &+ \vec{p}_1 &= [\,0.8415,\; 0.5403,\; 1.0100,\; 0.9999\,]\\ \vec{z}_{\text{hit}} &= \vec{v}_{hit} \hspace{-20pt} &+ \vec{p}_2 &= [\,0.9093,\; 0.5839,\; 0.0200,\; 0.9998\,]\\ \vec{z}_{man} &= \vec{v}_{man} \hspace{-20pt} &+ \vec{p}_3 &= [\,1.1411,\; -0.9899,\; 0.0300,\; 0.9996\,]\\[10pt] \end{aligned}\]

Now, even if \(\vec{v}_{man}\) is the same for both sentences, the vector \(\vec{z}_{man}\) is different.


Transformer

We have now introduced all the building blocks needed to understand the transformer.

The model was originally proposed as a sequence transduction model, $i.e.$ a model that maps an input sequence to an output sequence. It is a neural network that relies entirely on the attention mechanism to model relationships between tokens in a sequence. Its key idea is to transform a sequence of input embeddings into a sequence of meaningful representations by repeatedly applying attention and feed-forward transformations.

At a higher level, the transformer is composed of two main components: the encoder, which processes the input sequence and builds a contextualized representation of it, and the decoder, which takes that representation and generates the corresponding output sequence. Both components are structured as stack of identical blocks, each containing several layers.

Encoder

The encoder receives as input a sequence of tokens and produces a sequence of context-rich vectors, where essential information and contextual relationships are captured.

Let $L$ be the length of the input sequence (number of tokens) and $d_{\text{model}}$ the dimension of the original embedding space. Before entering the encoder, each token $i$ of the sequence is converted into its corresponding embedding vector \(\vec{v}_i\in\mathbb{R}^{d_{\text{model}}}\). Then, to each embedding vector \(\vec{v}_i\) is added the positional encoding \(\vec{p}_i\in\mathbb{R}^{d_{\text{model}}}\) to inject information about the position of the token in the sequence. The result is the positional-enhanced input.

\[Z = X + P = \begin{bmatrix} \vec{v}_1 \\ \vec{v}_2 \\ .. \\ \vec{v}_L \end{bmatrix} + \begin{bmatrix} \vec{p}_1 \\ \vec{p}_2 \\ .. \\ \vec{p}_L \end{bmatrix} \in \mathbb{R}^{L \times d_{model}}.\]

The encoder consists of multiple identical blocks. Each block performs the following steps.

The attention transformation mixes information across tokens (global context). Instead, the FFN acts within each token’s feature space (local transformation).

Decoder

The decoder generates the output sequence one token at a time, while attending both to tokens already generated, and the ultimate input representation produced by the encoder.

Let $T$ be the length of the already produced output sequence and $d_{\text{model}}$ the dimension of the embedding space. Before entering the decoder, each token $j$ of the output sequence is converted to its corresponding embedding vector \(\vec{u}_j\in\mathbb{R}^{d_{\text{model}}}\). Then, to each $\vec{u}_j$ is added the positional encoding \(\vec{p}_j\in\mathbb{R}^{d_{\text{model}}}\).

\[Z' = Y + P = \begin{bmatrix} \vec{u}_1 \\ \vec{u}_2 \\ .. \\ \vec{u}_T \end{bmatrix} + \begin{bmatrix} \vec{p}_1 \\ \vec{p}_2 \\ .. \\ \vec{p}_L \end{bmatrix} \in \mathbb{R}^{T \times d_{model}}.\]

The decoder consists of multiple identical blocks. Unlike the encoder, where the attention is computed always over the same input sequence, the decoder has two types of attention: one computed over the previously generated outputs, the other computed over the previously generated outputs (as queries) while attending to the encoder input representations (as keys and values). In details, each block performs the following steps.

Prediction

At the end of the decoder stack, the sequence of output vectors \(\{\text{Output}_3^{(t)}\}_{t=1}^T\), called hidden vectors, encodes both the information from previously generated tokens, and the contextual information from the encoder. Among these $T$ vectors, only the hidden vector at the latest position is used to predict the next token: \(\text{Output}_3^{(T)} \in \mathbb{R}^{d_{\text{model}}}\).

Let \(d_{\text{voc}}\) be the size of the target vocabulary. To obtain the next-token distribution over the vocabulary, the latest position vector is projected through a learned linear transformation \(W_{\text{voc}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{voc}}}\), and a softamx is applied to the output:

\[\mathbb{P}_{T+1} = \text{softmax}\!\left(W_{\text{voc}}\,\text{Output}_3^{(T)} + \vec{b}_{\text{voc}}\right).\]

The model then selects either the most likely token, or one sampled from this distribution. This token becomes the next element of the sequence, is appended to the list of generated tokens, and is fed back into the decoder at the following step. The process continues until the model outputs an end-of-sequence (EOS) token or reaches a maximum length $T_{\text{max}}$.




Resources