Conjugate Gradients
A-conjugacy
We begin with a unit circle plot as shown below. We draw 2 perpendicular (orthogonal) vectors (orange) on the circumference. We can verify that these vectors are orthogonal by taking the dot product and obtaining 0.
Next, let us stretch the space along the x-axis by a factor of $2$. This stretching operation can be thought of as warping our Cartesian plane. We see this ‘stretch’ if we continue to use our x & y axis in terms of the same units. e.g. same units of meters on x & y axis.
Now, how to ‘stretch’ the x-axis by 2? This can be done by a transformation matrix $T$.
\[\begin{aligned} \boldsymbol{T} &= \begin{bmatrix} 2 \ 0 \\ 0 \ 1 \end{bmatrix}\\ \boldsymbol{T}\boldsymbol{z} &= \begin{bmatrix} 2 \ 0 \\ 0 \ 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \end{aligned}\]However, if we think of the transformed plane as having the x axis scaled by 2, and keeping y the same unit, we will still have a circle on a transformed plane. This is similar idea to log-linear graphs, or a graph of $cm$ vs $mm$.
The arrow vectors are $\boldsymbol{A}$ orthogonal vectors in the scaled space, as there exist an $\boldsymbol{A}$ matrix (inverse operation of $\boldsymbol{T}$) that reverses the transformation step (“unwarps our space”). And we know that in the unwarped space, the arrow vectors are orthogonal.
Interactive Demo
1
1
Consider the quadratic problem
$f(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^T\boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}^T\boldsymbol{x}+c$
where $\boldsymbol{A}$ is a symmetric positive definite matrix, $\boldsymbol{x}$ and $\boldsymbol{b}$ are vectors, c some arbitrary constant.
The gradient is
\(\begin{aligned}
\nabla \boldsymbol{f}_x &= \frac{1}{2}\boldsymbol{A}^T\boldsymbol{x}+\frac{1}{2}\boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}\\
\nabla \boldsymbol{f}_x &= \boldsymbol{A}\boldsymbol{x}-\boldsymbol{b} &(since \ \boldsymbol{A} \ is \ symmetric)\\
\boldsymbol{0} &= \boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}\\
\boldsymbol{A}\boldsymbol{x}&=\boldsymbol{b}\\
\end{aligned}\)
The global minimizer to the quadratic problem can be found by solving $\boldsymbol{A}\boldsymbol{x}=\boldsymbol{b}$ if $\boldsymbol{A}$ is positive definite ($\boldsymbol{x}^T\boldsymbol{A}\boldsymbol{x}>0\ for\ all\ \boldsymbol{x}$)
Now for simplicity, I will abuse notation and treat unbold symbols as matrix/vectors.
Proof:
Let $x^*$ be the minimum point, $A$ positive definite matrix, $\delta $ be some perturbation.
\(\begin{aligned}
f(x^*+\delta) &= \frac{1}{2}(x^*+\delta)^TA(x^*+\delta)-b^T(x^*+\delta)+c\\
&= \frac{1}{2}(x^*)^TAx^*+\frac{1}{2}(x^*)^TA\delta+\frac{1}{2}\delta^TAx^*+\frac{1}{2}\delta^TA\delta-b^T(x^*+\delta)+c\\
&= \frac{1}{2}(x^*)^TAx^*-b^Tx^*+c+(\delta)^TAx^*-b^T\delta+\frac{1}{2}\delta^TA\delta\\
&= f(x^*)+\frac{1}{2}\delta^TA\delta\\
\end{aligned}\)
Since $\frac{1}{2}\delta^TA\delta>0$ due to positive definiteness, $x^*$ is the global minimum.
Given $p = \{d_1,d_2…d_n\}$ of mutually conjugate vectors, $P$ forms a set of basis of $\mathbb{R}^n$ and $A$ to be symmetrical positive definite matrix. Positive definiteness implies strict convexity i.e. there is a minimum solution and no saddle points.
If $d_i^TAd_j=0$ then $d_i$ is $A$ conjugate to $d_j$.
Now since $P$ is a set of $n$ basis vectors, we can write \(\begin{aligned} x^* = \sum_{i=1}^n\alpha_id_i \end{aligned}\) To solve for $\alpha_i$ we apply the A-conjugate property \(\begin{aligned} Ax^* &= \sum_{i=1}^n\alpha_iAd_i\\ d_j^TAx^* &= \sum_{i=1}^n\alpha_id_j^TAd_i\\ d_j^Tb &= \alpha_jd_j^TAd_j\\ \alpha_j &= \frac{d_j^Tb}{d_j^TAd_j} \end{aligned}\)
Orthogonal Directions
Orthogonal Direction is the idea that for each direction $d_i$ in a $\mathbb{R}^n$ space, the solution can be found in $n$ steps. This is done by having each iteration step through to the correct $x_i$ for the ith direction. e.g. If we imagine the direction vectors as the basis vectors, we are stepping through each basis vectors. So it takes 2 steps to get to the solution for a circular contour plot - 1st step to correct x coordinate, 2nd step to correct y coordinate.
Conjugate Directions
Taking the idea of the circular contuor plot of orthogonal direction, we can apply that idea to the A-conjugate case.
Let residual = -gradient. \(\begin{aligned} x_{j+1} &= x_j+\alpha_jd_j\\ \alpha_j &= \frac{d^T_jr_j}{d_j^TAd_j}\\ where \ r_j &= b-Ax_j \ is \ the \ jth \ residual\\ \end{aligned}\)
Proof: \(\begin{aligned} x^*-x_0 &= \sum_{i=1}^n\alpha_i d_i\\ x_j-x_0 &= \sum_{i=1}^{j-1}\alpha_i d_i\\ d_j^TA(x^*-x_0) &= d_j^TA(\sum_{i=1}^n\alpha_i d_i)\\ d_j^TA(x^*-x_0) &= \sum_{i=1}^n\alpha_i d_j^T A d_i\\ d_j^TA(x^*-x_0) &= \alpha_j d_j^T A d_j\\ \alpha_j &= \frac{d_j^TA(x^*-x_0) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^TA(x^*-x_j+x_j-x_0) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^TA(x^*-x_j) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^T(b-Ax_j) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^Tr_j }{d_j^T A d_j}\\ \end{aligned}\)
Gram-Schmidt
Let ${u_0,u_2…u_{n-1}}$ be linearly independent vectors.
\[\begin{aligned} d_i &= u_i+\sum_{k=0}^{i-1}\beta_{ik} d_k\\ d_j^TAd_i &= d_j^TAu_i+d_j^TA\sum_{k=0}^{i-1}\beta_{ik} d_k \ where \ j<i\\ 0 &= d_j^TAu_i+d_j^TA\sum_{k=0}^{i-1}\beta_{ik} d_k\\ 0 &= d_j^TAu_i+d_j^TA\beta_{ij} d_j\\ \beta_{ij} &= -\frac{d_j^TAu_i}{d_j^TAd_j}\\ \beta_{ij} &= -\frac{u_i^TAd_j}{d_j^TAd_j}\\ \end{aligned}\]Conjugate Gradients
Conjugate gradients (CG) is essentially conjugate directions, setting $u$ to the residual $r$.
-
Update x \(\begin{aligned} x_{j+1} &= x_j+\alpha_jd_j\\ \alpha_j &= \frac{d_j^Tr_j }{d_j^T A d_j}\\ \end{aligned}\)
-
Update residual \(\begin{aligned} r_{j+1} &= -Ae_{j+1} \ (e_j=x_j-x^*)\\ r_{j+1} &= -A(e_j+\alpha_jd_j)\\ r_{j+1} &= r_j-\alpha_jAd_j\\ \end{aligned}\)
-
Update direction vector. As all $r_i$ are A orthogonal, $\beta_{ij} = -\frac{r_i^TAd_j}{d_j^TAd_j}$ becomes $\beta_{j+1} = -\frac{r_j^TAd_j}{d_j^TAd_j}$
Julia implementation
Output:
$\beta$ simplification
$\beta$ can be further simplified by the following steps. \(\begin{equation} r_{j+1} = r_j - \alpha_jAd_j \tag{1}\label{eq:residual} \end{equation}\)
\[\begin{aligned} r_i^Tr_{j+1} &= r_i^Tr_j - \alpha_jr_i^TAd_j\\ r_i^TAd_j &= \frac{r_i^Tr_j - r_i^Tr_{j+1} }{\alpha_i} \\ &= \frac{- r_i^Tr_{j+1} }{\alpha_i} \ for \ i=j+1\\ \end{aligned}\]Using (\ref{eq:residual}) again \(\begin{aligned} r_{j+1} &= r_j - \alpha_jAd_j \\ d_j^Tr_{j+1} &= d_j^Tr_j - \alpha_jd_j^TAd_j\\ 0 &= d_j^Tr_j - \alpha_jd_j^TAd_j\\ \alpha_jd_j^TAd_j &= d_j^Tr_j\\ using \ d_i^T &= u_i^T+\sum_{k=0}^{i-1}\beta_{ik} d_k\\ d_i^Tr_j &= u_i^Tr_j+r_j\sum_{k=0}^{i-1}\beta_{ik} d_k\\ d_i^Tr_j &= u_i^Tr_j\\ \alpha_jd_j^TAd_j &= r_j^Tr_j\\ \end{aligned}\) Hence, \(\begin{aligned} \beta_{ij} &= -\frac{r_i^TAd_j}{d_j^TAd_j}\\ \beta_{j+1} &= \frac{r_{j+1}^Tr_{j+1}}{r_{j}^Tr_{j}} \end{aligned}\)
CG summary
To summarise the conjugate gradient algorithm,
Conjugate direction algorithm
Julia implementation example
Output:
Plot a quadratic curve and visualise CG steps.