Positional Encodings
Why Positional Information?
Attention is permutation-equivariant: swapping input positions swaps outputs identically. But language has order! "Dog bites man" ≠ "Man bites dog".
We need to inject positional information.
Absolute Positional Encodings
Sinusoidal (Original Transformer)
Vaswani et al. (2017) used fixed sinusoidal functions:
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$
Properties:
- Each dimension oscillates at different frequency
- Can extrapolate to longer sequences (in theory)
- $PE_{pos+k}$ can be represented as linear function of $PE_{pos}$
Learned Positional Embeddings
Simply learn a lookup table:
$$X^{ia}_{\text{input}} = X^{ia}_{\text{token}} + P^{ia}$$
where $P \in \mathbb{R}^{L_{max} \times d}$ is learned.
Tradeoff:
- More flexible than sinusoidal
- Cannot extrapolate beyond training length
Relative Positional Encodings
The Problem with Absolute
Absolute encodings conflate content with position. The model must learn that "word at position 5" attending to "word at position 3" is similar to "position 10 attending to position 8".
Relative encodings directly encode the offset.
T5-Style Relative Bias
Add a learned bias based on relative position:
$$S^{ij} = \frac{1}{\sqrt{d_k}} Q^{ia} K^{ja} + b_{i-j}$$
where $b_k$ is a learned bias for relative position $k$.
Bucketing: T5 uses logarithmic bucketing for large offsets:
- Exact positions for $|k| \leq 8$
- Bucketed for larger offsets
Transformer-XL Style
Decompose attention into content and position terms:
$$S^{ij} = \underbrace{Q^{ia} K^{ja}}_{\text{content-content}} + \underbrace{Q^{ia} R^{(i-j)a}}_{\text{content-position}} + \underbrace{u^a K^{ja}}_{\text{global content}} + \underbrace{v^a R^{(i-j)a}}_{\text{global position}}$$
where:
- $R^{ka}$: Relative position embeddings
- $u^a, v^a$: Learned global query vectors
ALiBi (Attention with Linear Biases)
Press et al. (2022) proposed a simple approach:
$$S^{ij} = \frac{1}{\sqrt{d_k}} Q^{ia} K^{ja} - m \cdot |i - j|$$
where $m$ is a head-specific slope.
Key insight: No learned positional parameters! Just a linear penalty for distance.
Slopes: Different heads use different slopes: $m_h = 2^{-8h/H}$
| Head | Slope | Effect |
|---|---|---|
| 1 | Large | Very local attention |
| H | Small | Global attention |
Extrapolation: ALiBi extrapolates well to longer sequences than trained on.
Rotary Position Embeddings (RoPE)
Core Idea
Su et al. (2021): Encode position by rotating the query/key vectors.
For position $m$, rotate by angle $m\theta$:
$$f(x, m) = R_m x$$
where $R_m$ is a rotation matrix.
Complex Number Formulation
For 2D, think of $(q_1, q_2)$ as complex number $q_1 + iq_2$:
$$f(q, m) = q \cdot e^{im\theta} = (q_1 + iq_2)(\cos m\theta + i\sin m\theta)$$
Expanding: $$\text{Re}[f(q,m)] = q_1 \cos m\theta - q_2 \sin m\theta$$ $$\text{Im}[f(q,m)] = q_1 \sin m\theta + q_2 \cos m\theta$$
Block-Diagonal Rotation
For $d$-dimensional vectors, apply 2D rotations to pairs:
$$R_m = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & & & \\ \sin m\theta_1 & \cos m\theta_1 & & & \\ & & \cos m\theta_2 & -\sin m\theta_2 & \\ & & \sin m\theta_2 & \cos m\theta_2 & \\ & & & & \ddots \end{pmatrix}$$
Different frequencies: $\theta_i = 10000^{-2i/d}$
Key Property
The dot product depends only on relative position:
$$(R_m q)^T (R_n k) = q^T R_m^T R_n k = q^T R_{n-m} k$$
This is because rotations compose: $R_m^T R_n = R_{n-m}$.
In Index Notation
For query $Q^{ia}$ at position $i$:
$$\tilde{Q}^{ia} = R^{ab}(i) Q^{ib}$$
Score computation:
$$S^{ij} = \frac{1}{\sqrt{d_k}} \tilde{Q}^{ia} \tilde{K}^{ja} = \frac{1}{\sqrt{d_k}} R^{ab}(i) Q^{ib} R^{ac}(j) K^{jc}$$
Using $R^T(i) R(j) = R(j-i)$:
$$S^{ij} = \frac{1}{\sqrt{d_k}} Q^{ia} R^{ab}(j-i) K^{jb}$$
Efficient Implementation
No explicit matrix multiplication needed! Use:
$$\tilde{q} = q \odot \cos(m\theta) + \text{rotate_half}(q) \odot \sin(m\theta)$$
where rotate_half swaps and negates pairs:
rotate_half([q1, q2, q3, q4, ...]) = [-q2, q1, -q4, q3, ...]
Gradients for RoPE
Since RoPE is just multiplication by rotation matrices:
$$\frac{\partial L}{\partial Q^{ia}} = R^{ab}(i) \frac{\partial L}{\partial \tilde{Q}^{ib}}$$
The rotation is its own transpose (orthogonal), so gradients just rotate back.
Comparison Table
| Method | Learnable | Extrapolation | Relative | Memory |
|---|---|---|---|---|
| Sinusoidal | No | Moderate | No | O(L·d) |
| Learned | Yes | Poor | No | O(L·d) |
| T5 Bias | Yes | Moderate | Yes | O(L²) |
| ALiBi | No | Excellent | Yes | O(1) |
| RoPE | No | Good | Yes | O(L·d) |
Code Example
from attn_tensors.positional import (
sinusoidal_encoding,
rotary_embedding,
apply_rope,
alibi_bias,
)
seq_len, d_model = 100, 64
# Sinusoidal
pos_enc = sinusoidal_encoding(seq_len, d_model)
X = X + pos_enc
# RoPE
Q_rotated = apply_rope(Q, positions)
K_rotated = apply_rope(K, positions)
scores = Q_rotated @ K_rotated.T / jnp.sqrt(d_k)
# ALiBi
scores = Q @ K.T / jnp.sqrt(d_k)
scores = scores + alibi_bias(seq_len, num_heads)
RoPE: Worked Example
Setup: $d = 4$, position $m = 2$, $\theta = [1.0, 0.1]$
Query: $q = [1, 0, 1, 0]$
Rotation angles: $m\theta = [2.0, 0.2]$
Apply rotation to pairs:
Pair 1: $(q_1, q_2) = (1, 0)$ $$\tilde{q}_1 = 1 \cdot \cos(2) - 0 \cdot \sin(2) = \cos(2) \approx -0.42$$ $$\tilde{q}_2 = 1 \cdot \sin(2) + 0 \cdot \cos(2) = \sin(2) \approx 0.91$$
Pair 2: $(q_3, q_4) = (1, 0)$ $$\tilde{q}_3 = 1 \cdot \cos(0.2) - 0 \cdot \sin(0.2) \approx 0.98$$ $$\tilde{q}_4 = 1 \cdot \sin(0.2) + 0 \cdot \cos(0.2) \approx 0.20$$
Result: $\tilde{q} \approx [-0.42, 0.91, 0.98, 0.20]$