Statistical Mechanics View

Softmax as Gibbs Distribution

The attention weights form a Gibbs distribution (Boltzmann distribution):

$$A^{ij} = \frac{e^{\beta S^{ij}}}{Z^i}$$

where:

$\beta$ is the inverse temperature
$Z^i = \sum_j e^{\beta S^{ij}}$ is the partition function

In standard attention, $\beta = 1$.

Temperature Effects

The temperature $T = 1/\beta$ controls attention sharpness:

Temperature	$\beta$	Behavior
High ($T \to \infty$)	$\beta \to 0$	Uniform attention
Normal ($T = 1$)	$\beta = 1$	Standard attention
Low ($T \to 0$)	$\beta \to \infty$	Hard attention (argmax)

High Temperature Limit

As $\beta \to 0$:

$$A^{ij} \to \frac{1}{n_k}$$

All keys receive equal attention.

Low Temperature Limit

As $\beta \to \infty$:

$$A^{ij} \to \begin{cases} 1 & \text{if } j = \arg\max_{j'} S^{ij'} \\ 0 & \text{otherwise} \end{cases}$$

Only the highest-scoring key receives attention.

Entropy of Attention

The entropy measures attention concentration:

$$H^i = -\sum_j A^{ij} \log A^{ij}$$

Properties:

Maximum entropy ($H = \log n_k$): Uniform attention
Minimum entropy ($H = 0$): Hard attention on single key
Lower entropy = more focused attention

Free Energy

The free energy connects entropy and energy:

$$F^i = -\frac{1}{\beta} \log Z^i = \langle S^{ij} \rangle - \frac{1}{\beta} H^i$$

where $\langle S^{ij} \rangle = \sum_j A^{ij} S^{ij}$ is the expected score.

Connection to Hopfield Networks

Ramsauer et al. (2020) showed attention implements a Hopfield network update:

$$\xi^{\text{new}} = V^T \text{softmax}(\beta K \xi)$$

The stored patterns are the rows of $K$. Attention retrieves the pattern most similar to the query.

Storage Capacity

Classical Hopfield networks store $\sim 0.14 n$ patterns.

Modern (attention-based) Hopfield networks can store exponentially many patterns: $\sim e^{d/2}$ for $d$-dimensional patterns.

Code Example

from attn_tensors.softmax import (
    softmax_temperature,
    attention_entropy,
    log_partition_function,
)

scores = jnp.randn(10, 20)

# Standard softmax (beta = 1)
A_normal = softmax_temperature(scores, beta=1.0)

# Sharp attention (low temperature)
A_sharp = softmax_temperature(scores, beta=10.0)

# Soft attention (high temperature)
A_soft = softmax_temperature(scores, beta=0.1)

# Compute entropy
H = attention_entropy(A_normal)  # shape: (10,)

# Log partition function
log_Z = log_partition_function(scores, beta=1.0)

Physical Interpretation

Think of attention as a physical system:

Keys = possible states
Scores = negative energies (higher score = lower energy = more probable)
Temperature = randomness in state selection
Partition function = normalization over states

At low temperature, the system "freezes" into the lowest energy state (highest score). At high temperature, all states are equally likely.

Energy-Based View

Attention as Energy Minimization

Define an energy function:

$$E(i, j) = -S^{ij} = -\frac{1}{\sqrt{d_k}} Q^{ia} K^{ja}$$

The attention weight is the Boltzmann probability:

$$A^{ij} = \frac{e^{-\beta E(i,j)}}{Z^i} = \frac{e^{\beta S^{ij}}}{Z^i}$$

Free Energy

The free energy for query $i$ is:

$$F^i = -\frac{1}{\beta} \log Z^i$$

This has the familiar form:

$$F^i = \langle E \rangle - \frac{1}{\beta} H^i$$

where $\langle E \rangle = -\sum_j A^{ij} S^{ij}$ is the expected energy and $H^i$ is the entropy.

Variational Principle

The attention weights minimize:

$$A^* = \arg\min_A \left[ \langle E \rangle - \frac{1}{\beta} H(A) \right]$$

subject to $\sum_j A^{ij} = 1$ and $A^{ij} \geq 0$.

This is equivalent to softmax!

Deep Dive: Hopfield Networks

Classical Hopfield (1982)

Energy function:

$$E = -\frac{1}{2} \sum_{i,j} W_{ij} s_i s_j$$

where $s_i \in {-1, +1}$ and $W_{ij}$ are synaptic weights.

Update rule:

$$s_i \leftarrow \text{sign}\left(\sum_j W_{ij} s_j\right)$$

Storage capacity: $\sim 0.14 N$ patterns for $N$ neurons.

Modern Hopfield (Ramsauer et al., 2020)

Energy function:

$$E = -\text{lse}(\beta K \xi) + \frac{1}{2}\xi^T \xi + \text{const}$$

where $\text{lse}(x) = \log \sum_i e^{x_i}$ is the log-sum-exp.

Update rule:

$$\xi^{new} = K^T \text{softmax}(\beta K \xi)$$

This is exactly attention! The query $\xi$ is updated to be a weighted combination of stored patterns (rows of $K$).

Why Exponential Capacity?

Classical Hopfield fails when patterns have overlap (correlation). The error probability grows with pattern density.

Modern Hopfield uses exponential separation:

$$\text{softmax}(\beta x)_i \approx \begin{cases} 1 & x_i = \max(x) \\ e^{-\beta \Delta} & x_i = \max(x) - \Delta \end{cases}$$

For large $\beta$, even small separation $\Delta$ gives clean retrieval.

Capacity: $\sim \exp(d/2)$ patterns in $d$ dimensions!

Attention as Associative Memory

Attention	Hopfield
Query $Q$	Pattern to retrieve
Keys $K$	Stored patterns
Values $V$	Pattern outputs
Softmax	Update rule
Output $O$	Retrieved pattern

Worked Example: Pattern Retrieval

Setup: Store 3 patterns as keys, retrieve closest to query.

$$K = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 0.7 & 0.7 \end{pmatrix}, \quad q = \begin{pmatrix} 0.9 \\ 0.1 \end{pmatrix}$$

Step 1: Compute scores

$$s = K q = \begin{pmatrix} 0.9 \\ 0.1 \\ 0.7 \end{pmatrix}$$

Step 2: Apply softmax (with $\beta = 1$)

$$a = \text{softmax}(s) \approx \begin{pmatrix} 0.48 \\ 0.22 \\ 0.30 \end{pmatrix}$$

Step 3: Retrieve pattern

$$\xi^{new} = K^T a = \begin{pmatrix} 0.48 + 0.21 \\ 0.22 + 0.21 \end{pmatrix} = \begin{pmatrix} 0.69 \\ 0.43 \end{pmatrix}$$

The query moved toward pattern 1 (which it was closest to).

With high temperature ($\beta = 5$):

$$a = \text{softmax}(5s) \approx \begin{pmatrix} 0.88 \\ 0.01 \\ 0.11 \end{pmatrix}$$

Now retrieval is sharper—almost pure pattern 1.

Thermodynamic Quantities

Heat Capacity

The heat capacity measures sensitivity to temperature:

$$C = \frac{\partial \langle E \rangle}{\partial T} = \beta^2 \text{Var}(E)$$

High heat capacity near phase transitions—when attention is "deciding" between multiple keys.

Susceptibility

Response to perturbation in scores:

$$\chi^{ij}_{kl} = \frac{\partial A^{ij}}{\partial S^{kl}}$$

This is exactly the softmax Jacobian we derived for gradients!

Connection to Information Theory

Mutual Information

The attention weights encode mutual information:

$$I(Q; K) \approx H(A) - H(A|Q)$$

where $H(A)$ is the entropy of attention patterns.

KL Divergence and Attention

The softmax minimizes KL divergence to a uniform prior:

$$A^* = \arg\min_A \left[ -\sum_j A^{ij} S^{ij} + \frac{1}{\beta} D_{\text{KL}}(A | U) \right]$$

where $U$ is the uniform distribution.