Constrained Optimisation for Statistical Mechanics

Note to reader: this article assumes knowledge of statistical mechanics. I highly recommend the thermal physics lectures notes of the Oxford Physics undergraduate programme, a beautiful exposition of the subject written by the brilliant Alexander Schekochihin. The notes logically constructs statistical mechanics with great clarity without sparing the reader of necessary complexities. I draw most of the mathematical material in this article on Chapter 5 of Convex Optimization by Boyd and Vandenberghe, which is an accurate but accessible introduction to the mathematics of optimisation. Both of these materials are freely available online as pdfs.

The heart of statistical mechanics is the maximum entropy optimisation problem

$\min_{p \in \text{dom}\ f}f(p)$

where the objective function $f(p)$ is the negative entropy

$f(p) = \sum_i p_i \log(p_i) = -\frac{S}{k_B}$

and $p$ is subject to the equality constraints

normalised probability: $\sum_i p_i = 1$
fixed average energy: $\sum_i \epsilon_ip_i = U$
fixed average number of particles of each species $j= 1, \ldots, m-2$ : $\sum_i n^{(j)}_i p_i = N^{(j)}$ .

This is the starting point for obtaining the probability $p_i$ of finding a state with energy $E_i$ and number of particles of the $j$ th species $n^{(j)}_i$ in a grand canonical ensemble. Systems in the grand canonical ensemble have a fixed average internal energy $U$ and number of particles $N^{(j)}$ .

Most undergraduate physics textbooks instruct users to find $p_i$ using the method of Lagrange multipliers. Often the proof that the method works is briskly sketched out or motivated via geometrically intuitive arguments and the student is hurried along to apply the method. But it is an intellectual loss to hurry along because the mathematics behind the method of Lagrange multipliers - the theory of constrained optimisation - is one that is beautiful and fascinating. As we will demonstrate this formalism also informs the physics too.

Theory of Constrained Optimisation

Constrained optimisation is a vast field; here we limit our focus to the mathematics specific to the problem of entropy maximisation. We focus on equality constraints and leave the discussion of inequality constraints to other texts, such as Boyd and Vandenberghe. Moreover we focus on the optimisation of convex functions which is much simpler to deal with because convex functions have a unique global minimum. There are many neat theoretical results

The general class of problems we discuss here are of the form

$\min_{x \in \text{dom}f} f(x)$

where $f: \mathbb{R}^n \to \mathbb{R}$ is convex and the allowed values of $x$ is subject to $m$ affine equality constraints of the form

$h_i(x) = a_i^T x-b_i = 0 \quad i= 1, \ldots, m$

where $a_i \in \mathbb{R}^n$ and $b \in \mathbb{R}^m$ . We say $x$ is feasible if it satisfies the constraints and $x^\ast$ to be optimal if it minimises $f(x)$ out of the feasible subset of $\text{dom} f$ .

The Lagrangian

When we use the method of Lagrange multipliers, we write down a Lagrangian such as the following for the entropy maximisation problem

$\mathcal{L}(p, \beta, \lambda_j, \nu) = f(p) +\beta(\sum_i \epsilon_ip_i - U) - \sum_j \lambda_j (\sum_i n^{(j)}_i p_i - N^{(j)})- (\nu-1)(\sum_i p_i - 1).$

For notational purposes, we collect the Lagrange multipliers, or dual variables $\beta$ , $\lambda_j$ and $\nu$ into entries of the vector $d$ . Then

$\mathcal{L}(p, d) = f(p) + d^T h.$

Observe that $\mathcal{L}(p, d)$ reduces to $f(p)$ if $p$ satisfies the primal constraints $h=0$ . We enforce the constraints on $p$ in the primal problem by taking the supremum of the Lagrangian over the dual variables; if the constraints are not satisfied, $\tilde{f}(p) = \sup_{d \in \mathbb{R}^m}\mathcal{L}(p, d) = +\infty$ . This is because when we can make $d^Th$ in the Lagrangian arbitrarily large when we take the supremum over $d$ if $h \neq 0$ . Minimising $\tilde{f}(p)$ can only find minimisers that satisfy the constraints; in other words it is equivalent to finding the minimsers of the primal problem. Supposing feasible points that satisfy the constraint exists, a minimiser $p^\ast$ of $f(p)$ is also a minimiser of $\tilde{f}(p)$ , and vice versa:

$f(p^\ast)= \tilde{f}(p^\ast) = \inf_{p \in \text{dom}\ f} \sup_{d \in \mathbb{R}^m}\mathcal{L}(p, d).$

We call this the primal problem.

The Dual Problem and Duality

Instead of taking first the supremum $\sup_{d \in \mathbb{R}^m}$ then the infimum $\inf_{p \in \text{dom}\ f}$ of the Lagrangian, we can consider taking the infimum first $g(d) = \inf_{p \in \text{dom}\ f}\mathcal{L}(p, d)$ and then maximising $g(d)$ . We call this problem the dual problem of our primal problem: $g^\ast =g(d^\ast)= \sup_{d \in \mathbb{R}^m}\inf_{p \in \text{dom}\ f}\mathcal{L}(p, d).$

How do the primal and dual problems relate to each other? We make use of the Max-min inequality:

Theorem (Max-min inequality). For any function $\psi: Z \times W \to \mathbb{R}$ , $\sup_{z \in Z} \inf_{w \in W} \psi(z,w) \leq \inf_{w \in W} \sup_{z \in Z} \psi(z,w).$

We direct the reader to the wikipedia page on the inequality for its very simple proof. Applying the inequality to the primal and weak problems, we have

$\sup_{d \in \mathbb{R}^m}\inf_{p \in \text{dom}\ f}\mathcal{L}(p, d) \leq \inf_{p \in \text{dom}\ f} \sup_{d \in \mathbb{R}^m}\mathcal{L}(p, d).$

In other words, the optimal value of the dual problem $g(d^\ast)$ sets a lower bound for $f(p^\ast)$ :

$g(d^\ast) \leq f(p^\ast)$

We call this feature of optimisation problems by weak duality. We say that whenever

$g(d^\ast) = \sup_{d \in \mathbb{R}^m}\inf_{p \in \text{dom}\ f}\mathcal{L}(p, d) = \inf_{p \in \text{dom}\ f} \sup_{d \in \mathbb{R}^m}\mathcal{L}(p, d) = f(p^\ast),$

the problem satisfies strong duality. Since $f(p^\ast) = \mathcal{L}(p^\ast, d)\ \forall d$ , $\sup_{d \in \mathbb{R}^m}\inf_{p \in \text{dom}\ f}\mathcal{L}(p, d) = \mathcal{L}(p^\ast, d^\ast) = \inf_{p \in \text{dom}\ f} \sup_{d \in \mathbb{R}^m}\mathcal{L}(p, d).$

Theorem (Slater’s theorem for Strong Duality). Suppose $f(p)$ is convex and there exists $p \in \text{relint}\ \text{dom} f$ (relative interior of the domain) satisfying strict convex inequality constraints and affine equlaity constraints. Then strong duality holds.

The reader can delve into the technical details of the theorem and its proof in pages 226 and 234-236 of Boyd and Vandenberghe.

Because negative entropy is convex, we in fact have strong duality in our max entropy problem. This encourages us to approach our optimisation problem from its dual.

Before we proceed to solve our maximum entropy problem, we take a detour to look at saddle points and their relation to strong duality.

Saddle Points and Strong Duality

Definition: $(\tilde{w}, \tilde{z}) \in W \times Z$ is a saddle point for $\psi : W \times Z \to \mathbb{R}$ iff $\sup_{z \in Z} \psi(\tilde{w}, z) = \psi(\tilde{w}, \tilde{z}) = \inf_{w \in W} \psi(w, \tilde{z}) \quad \forall (w,z) \in W \times Z$

Proposition: $(p^\ast, d^\ast) \in \text{dom} f \times \mathbb{R}^m$ is a saddle point of $\mathcal{L}(p, d)$ $\Leftrightarrow$ $(p^\ast, d^\ast)$ are primal and dual optimal respectively, and strong duality holds for $\mathcal{L}(p, d)$ .

Proof. “ $\Leftarrow$ ” Suppose $(p^\ast, d^\ast)$ are primal and dual optimal and strong duality holds. Strong duality implies $f(p^\ast) = g(d^\ast)$ . Since $g(d) = \inf_{p \in \text{dom}f} \mathcal{L}(p,d)$ , we can put $d = d^\ast$ and get $g(d^\ast) = \inf_{p \in \text{dom}f}\mathcal{L}(p,d^\ast)$ . Therefore

$f(p^\ast) = \inf_{p \in \text{dom}f}\mathcal{L}(p,d^\ast). \tag{$\ast$}$

However since $p^\ast$ must be primal feasible, with only equality constraints we have $f(p^\ast) = \mathcal{L}(p^\ast, d) \ \forall d \in \mathbb{R}^m$ . In particular,

$\sup_{d \in \mathbb{R}^m}\mathcal{L}(p^\ast, d) = f(p^\ast) = \mathcal{L}(p^\ast, d^\ast) \tag{$\ast \ast$}$

Combining $(\ast)$ and $(\ast \ast)$ , we obtain the saddle point condition

$\sup_{d \in \mathbb{R}^m} \mathcal{L}(p^\ast, d) = \mathcal{L}(p^\ast, d^\ast) = \inf_{p \in \text{dom}f} \mathcal{L}(p, d^\ast) . \tag{SP}$

” $\Rightarrow$ ” Suppose we have the saddle point condition $(\text{SP})$ . We first prove that $p^\ast$ is indeed primal feasible. Writing out the Lagrangian in the left hand side relation of $(\text{SP})$ , $\sup_{d \in \mathbb{R}^m} \mathcal{L}(p^\ast, d) = \mathcal{L}(p^\ast, d^\ast)$ implies

$f(p^\ast) + d^Th(p^\ast) = \mathcal{L}(p^\ast, d) \leq \mathcal{L}(p^\ast, d^\ast) = f(p^\ast) + {d^\ast}^Th(p^\ast)\quad \forall d \in \mathbb{R}^m$

or $d^Th(p^\ast) \leq {d^\ast}^Th(p^\ast)\ \forall d \in \mathbb{R}^m$ . This can only be true if $h(p^\ast) = 0$ i.e. $p^\ast$ is primal feasible and $f(p^\ast) = \mathcal{L}(p^\ast, d)$ .

We proceed to show that $p^\ast$ is indeed primal optimal. The right hand side of $(\text{SP})$ states that $\mathcal{L}(p, d^\ast) \geq \mathcal{L}(p^\ast, d^\ast) \quad \forall {p \in \text{dom}f}$ . Since $p^\ast$ is primal feasible, this inequality must also hold if we restrict $p$ to the subset of $\text{dom} f$ which is primal feasible. But in the feasible subset, $\mathcal{L}(p,d) = f(p)\ \forall d \in \mathbb{R}^m$ . Therefore $f(p) \geq f(p^\ast)$ for all feasible $p$ . In other words, $p^\ast$ is primal optimal.

To show that $d^\ast$ is dual optimal, we observe that by definition $g(d^\ast) = \inf_{p \in \text{dom}f} \mathcal{L}(p, d^\ast)$ . But from the right hand side of $(\text{SP})$ , $\mathcal{L}(p^\ast, d^\ast) = \inf_{p \in \text{dom}f} \mathcal{L}(p, d^\ast)$ . Since the infimum of $\mathcal{L}(p,d^\ast)$ is unique, we conclude that $g(d^\ast) = \mathcal{L}(p^\ast, d^\ast) = f(p^\ast)$ . Since weak duality holds generally, $g(d) \leq f(p^\ast)\ \forall d \in \mathbb{R}^m$ . But $f(p^\ast) = g(d^\ast)$ , therefore $g(d^\ast) \geq g(d)\ \forall d \in \mathbb{R}^m$ . We have therefore shown that $d^\ast$ is dual optimal and strong equality holds. $\small{\quad \square}$

Using Duality to Solve Optimisation Problems

We are now in a position to prove that the method of Lagrange multipliers indeed works!

Theorem: if strong duality holds for the optimisation problem and $\mathcal{L}(p, d^\ast)$ has a unique minimiser $\hat{p}$ , then $\hat{p} = p^\ast$ .

Proof. Since strong duality holds, there exists a feasible minimiser $p^\ast$ of the primal problem $f(p)$ and a feasible maximiser $d^\ast$ of the dual function $g(d)$ . As a consequence of strong duality, the saddle point of the Lagrangian is primal and dual optimal, satisfying $\mathcal{L}(p, d^\ast) \geq \mathcal{L}(p^\ast, d^\ast)$ . choosing $p = \hat{p}$ , $\mathcal{L}(\hat{p}, d^\ast) \geq \mathcal{L}(p^\ast, d^\ast)$ . Since $\hat{p}$ minimises $\mathcal{L}(p, d^\ast)$ , $\mathcal{L}(\hat{p}, d^\ast) \leq \mathcal{L}(p^\ast, d^\ast)$ . To satisfy both equalities, we conclude that $\mathcal{L}(\hat{p}, d^\ast) = \mathcal{L}(p^\ast, d^\ast)$ . $\hat{p}$ is the unique minimiser of $\mathcal{L}(p, d^\ast)$ over all $p \in \text{dom} f$ . If $\hat{p}$ is feasible, then it must also be the unique minimiser over $\mathcal{L}(p, d^\ast)$ over the feasible subset of $\text{dom} f$ , therefore $p^\ast = \hat{p}$ . If $\hat{p}$ is not feasible, then $\hat{p} \neq p^\ast$ . But then one cannot have $\mathcal{L}(\hat{p}, d^\ast) = \mathcal{L}(p^\ast, d^\ast)$ while having $\hat{p}$ as the unique minimiser of $\mathcal{L}(p, d^\ast)$ . Therefore $\hat{p}$ must be feasible and therefore optimal. $\small{\quad \square}$

We will use this theorem to derive statistical mechanics.

Deriving Statistical Mechanics

We return to solving the max-entropy problem for the grand canonical ensemble. Since our objective function is convex and there are only affine equality constraints placed on $p$ , strong duality holds. Therefore we only need to find the maximiser of $\mathcal{L}(p, d^\ast)$ to find the minimiser of $f(p)$ in our primal problem. Differentiating $\mathcal{L}(p, d^\ast)$ w.r.t $p$ and setting it to 0, we find

$p^\ast_i = e^{\nu^\ast}\exp(-\beta^\ast \epsilon_i + \lambda_j^\ast n^j).$

What remains is to work out the dual optimal point $(\beta^\ast, \lambda_j^\ast, \nu^\ast)$ . Since $p^\ast$ must be primal feasible, it must satisfy the primal constraints:

$\sum_i p_i^\ast = 1 \quad \Rightarrow \quad Z = e^{-\nu^\ast} = \sum_i \exp(-\beta^\ast \epsilon_i + \lambda_j^\ast n^{(j)})$ $\sum_i p_i^\ast \epsilon_i = U \quad \Rightarrow \quad U = \frac{1}{Z}\sum_i \epsilon_i \exp(-\beta^\ast \epsilon_i + \lambda_j^\ast n^{(j)})$ $\sum_i p^\ast_i n^{(j)}_i = N^{(j)} \quad \Rightarrow \quad N^{(j)} = \frac{1}{Z}\sum_i n^{(j)}_i \exp(-\beta^\ast \epsilon_i + \lambda_j^\ast n^{(j)}) .$

The constraints $(I, N^{(j)})$ implicitly fix the dual variables $(\beta, \mu_j)$ . Writing out the Lagrangian at the saddle-point i.e. the primal and dual optimum, and letting $\beta^\ast = 1/k_B T$ , $\lambda^\ast_j = \beta^\ast \mu_j$ and $\nu = -\log Z$ , we have

Definition. The Grand Potential: $\Phi =- k_BT \log Z = U - TS - \sum_j \mu_j N^{(j)}.$

If we write out the differential

$dS = \frac{\partial S}{\partial U} dU + \frac{\partial S}{\partial \beta} d\beta + \sum_j \frac{\partial S}{\partial N^{(j)}}dN^{(j)} + \frac{\partial S}{\partial \mu_j} d\mu_j + \frac{\partial S}{\partial Z} dZ.$

and substitute the definition of the Grand Potential into the differential (We also observing that $Z = Z(\beta, \mu \bigl\lvert\ \{ \epsilon_i, n_i \rvert i \in I\})$ .), a page of tedious algebraic simplifications leads us to

Theorem. The First Law of Thermodynamics: $dU = TdS + \sum_j \mu_j dN^{(j)} -\mathcal{P}dV.$ where we have assumed the generalised pressure $\mathcal{P}$ is defined as $\mathcal{P} = \sum_i p_i (\frac{\partial \epsilon_i}{\partial V} -\mu\frac{\partial n_i}{\partial V}).$

Naturally , $S$ is really only dependent on the constraints $\{U, N^{(j)}\}$ and the particular physics of the states encoded in $\{ \epsilon_i, n_i \rvert i \in I\}$ .

Short cut via Dual Variables

The First Law can be used to derive the identities

$\frac{1}{T} = \frac{\partial S}{\partial U} \biggr\rvert_{N, V}\ ;\quad \frac{\mu}{T} = -\frac{\partial S}{\partial N} \biggr\rvert_{U, V}. \tag{#}$

As it turns out these relations are naturally borne out of the properties of the dual variables - they are more fundamental than the First Law! Consider again

$\mathcal{L} = f(p) + d^Th(p).$

Suppose we shift the constraints s.t. $h = H \neq 0$ and find the new minimum of $f(p)$ ; we call this the perturbed problem. Denote $f^\ast[0]$ be the minimum of the original problem and $f^\ast[H]$ be the minimum of the perturbed problem. If we have strong equality, our previous proposition implies

$f^\ast[0] \leq f(p) + d^\ast h(p) \quad \forall p \in \text{dom} f.$

Choose $p$ to be those which satisfy $h = H$ ; then

$f^\ast[0] \leq f(p) + d^\ast H.$

Furthermore choose $p$ to be the minimiser of the perturbed problem, i.e. $f(p) = f^\ast[H]$ :

$f^\ast[H] \geq f^\ast[0] - d^\ast H. \tag{I}$

Suppose $f^\ast[H]$ is differentiable at $H=0$ . For $H > 0$ , we can rearrange the inequality $(\text{I})$

$\frac{f^\ast[H] -f^\ast[0]}{H} \geq - d^\ast$

and take $H \to 0^+$ , yielding

$\nabla_H f^\ast \rvert_{H=0} \geq -d^\ast. \tag{$+$}$

We rearrange $(\text{I})$ for $H<0$

$\frac{f^\ast[H] -f^\ast[0]}{H} \leq -d^\ast$

Taking $H \to 0^-$ ,

$\nabla_H f^\ast \rvert_{H=0} \leq -d^\ast. \tag{$-$}$

Combining inequalities $(+)$ and $(-)$ , we conclude that

$\nabla_H f^\ast \rvert_{H=0} = -d^\ast. \tag{LSD}$

The gradient of the objective function minimum with respect to constraints shifts $H$ are the optimal dual variables. This lends a natural interpretation to the dual variables: they represent the local sensitivity of the optimum objective function with respect to the changes in the constraints. In the context of the grand canonical ensemble, $(\text{LSD})$ is simply our definitions of $T$ and $\mu$ in (#). As a consequence, rather than slogging through tedious algebraic expressions of differentials, we can use (#) and the definition of the Grand Potential to derive the First Law in one line!

So, what’s new?

So what have we gained by using this complex machinery of duality? Obviously we have learnt nothing particularly new per se; after all statistical mechanics is a mature field of physics! What duality does offer is a new perspective and aesthetic. I put forward two points as to why what we have done here has some aesthetic value.

Our Understanding of Temperature and Chemical Potentials

Asked why temperature is the measure of the change in entropy induced by a change in internal energy (see (#)), a traditional thermodyanmicist might appeal to the First Law of Thermodynamics and write out a one-line derivation of (#). Yet the First Law does not explain how temperature and chemical potentials came into being in the first place: it’s already there in the Law! As a statement, the First Law only contains marginally more information than (#); apart from the pressure term, (#) more or less is a statement of the First Law.

Going deeper, we know full well that the First Law is not a fundamental law; rather it is a consequence of a much deeper philosophy, the Maximum Entropy Principle. Though we could attempt to explain the genesis of (#) by tracing the origin of the First Law back to the Maximum Entropy Principle, this endeavour is not only more difficult but also too convoluted, when we could simply cut out the First Law middle man and define temperature and chemical potentials as local measures of the optimisation objective’s sensitivity to changes in the physical constraints. With this definition temperature and chemical potentials have a direct connection to the fundamental principle behind statistical mechanics.

The Grand Potential, begotten not made?

Most thermodynamics literature pulls the Grand Potential out of thin air and reveres it as a miraculous object; magically, all of the useful quantities in thermodynamics is obtained by taking partial derivatives of the Grand Potential. Textbooks just state its properties and tell students to take them away for good use. It is as if physicists stumbled upon the Grand Potential by chance or they had carried it down from Mount Sinai on a stone tablet.

Yet armed with mathematics we now know better. The definition of the Grand Potential is in fact an application of the equivalence of the Lagragian saddle point with the primal and dual optima, a most elegant theorem! The Grand Potential is the end result of the optimisation problem, not simply an ad hoc utility borne out of convenience. This revelation can only serve to elevate the status of this holy object in statistical physics.