Skip to content

Foundations of Transformer Language Modeling

约 448 字大约 1 分钟

LLM

2025-11-21

0. Introduction: the mathematical definition of language modeling

Natural language can be viewed as a discrete sequence generated from a finite vocabulary V\mathcal{V}:

x1,x2,,xT,xtV.x_1, x_2, \dots, x_T,\quad x_t \in \mathcal{V}.

The goal of a language model is to define the probability distribution

p(x1,,xT)=t=1Tp(xtx<t),p(x_1,\dots,x_T)=\prod_{t=1}^{T} p(x_t \mid x_{<t}),

where x<t=x1,,xt1x_{<t}=x_1,\dots,x_{t-1}.

1. The decoder-only Transformer

Modern language models such as GPT and LLaMA typically use the decoder-only Transformer, because it naturally matches autoregressive prediction.

Architecture of a decoder-only transformer language model. Source of figure: Language Models from Scratch, Stanford CS336 course notes
Architecture of a decoder-only transformer language model. Source of figure: Language Models from Scratch, Stanford CS336 course notes

1.1 Input representation: token embedding + position embedding

Each token id is first mapped into a vector:

et=E[xt]Rd,e_t = E[x_t] \in \mathbb{R}^d,

where ERV×dE \in \mathbb{R}^{|\mathcal{V}|\times d} is the embedding matrix.

Because attention by itself does not encode position, we add a positional vector ptp_t:

ht(0)=et+pt.h_t^{(0)} = e_t + p_t.

2. The Transformer block

Each Transformer layer contains two main parts:

  1. masked self-attention
  2. feed-forward network

with residual connections and layer normalization.

2.1 Linear projections to Q, K, and V

Let the input to a layer be

H=(h1,,hT)RT×d.H = (h_1,\dots,h_T) \in \mathbb{R}^{T\times d}.

We compute

Q=HWQ,K=HWK,V=HWV.Q = HW_Q,\quad K = HW_K,\quad V = HW_V.

Intuitively:

  • Q asks what information I want;
  • K describes what kind of information each token offers;
  • V carries the actual content that will be aggregated.

2.2 Scaled dot-product attention

The raw similarity matrix is

S=QKdk.S = \frac{QK^\top}{\sqrt{d_k}}.

2.3 Causal masking

For language modeling, token tt must not see the future. So we apply a causal mask:

Mij={0,ji,,j>i.M_{ij}= \begin{cases} 0, & j\le i,\\ -\infty, & j>i. \end{cases}

2.4 Softmax attention

The normalized attention matrix is

A=softmax(S+M).A = \text{softmax}(S+M).

2.5 Information aggregation

The output at position ii is

outputi=j=1Tai,jVj.output_i = \sum_{j=1}^{T} a_{i,j} V_j.

3. Multi-head attention and FFN

Instead of using a single attention mechanism, the model splits the representation into several heads. Each head can focus on a different type of dependency.

After attention, each token representation passes through a position-wise feed-forward network:

FFN(x)=W2σ(W1x+b1)+b2.\text{FFN}(x)=W_2 \sigma(W_1 x + b_1)+b_2.

4. Final remarks

If I had to summarize the whole picture in one sentence, it would be this:

A decoder-only Transformer is a machine that repeatedly decides what each token should attend to, then updates the representation accordingly, while never looking into the future.

贡献者: Junyuan He