Conjugate Gradients
A-conjugacy
We begin with a unit circle plot as shown below. We draw 2 perpendicular (orthogonal) vectors (orange) on the circumference. We can verify that these vectors are orthogonal by taking the dot product and obtaining 0.
using Plots
# Get circle points
x = collect(-1:0.01:1)
y = sqrt.(1 .- x.^2)
x = append!(x, -x)
y = append!(y, -y)
z = [x y]'
# Plot circle and perpendicular arrow vectors
plot(z[1,:], z[2,:], aspect_ratio=:equal)
v = [-1 -1 1 1 0 0 0 0;0 0 0 0 1 1 -1 -1]
Δv = [0.25 0.25 -0.25 -0.25 0.25 -0.25 0.25 -0.25;-0.25 0.25 0.25 -0.25 -0.25 -0.25 0.25 0.25]
quiver!(v[1,:], v[2,:], quiver=(Δv[1,:], Δv[2,:]))
Next, let us stretch the space along the x-axis by a factor of $2$. This stretching operation can be thought of as warping our Cartesian plane. We see this ‘stretch’ if we continue to use our x & y axis in terms of the same units. e.g. same units of meters on x & y axis.
Now, how to ‘stretch’ the x-axis by 2? This can be done by a transformation matrix $T$.
\[\begin{aligned} \boldsymbol{T} &= \begin{bmatrix} 2 \ 0 \\ 0 \ 1 \end{bmatrix}\\ \boldsymbol{T}\boldsymbol{z} &= \begin{bmatrix} 2 \ 0 \\ 0 \ 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \end{aligned}\]# Transformation matrix
T = [2 0;0 1]
tz = T*z
v_trans = T*v
Δv_trans = T*Δv
# Plot scaled circled and resulting arrow vectors
plot(tz[1,:], tz[2,:], aspect_ratio=:equal)
quiver!(v_trans[1,:], v_trans[2,:], quiver=(Δv_trans[1,:], Δv_trans[2,:]))
However, if we think of the transformed plane as having the x axis scaled by 2, and keeping y the same unit, we will still have a circle on a transformed plane. This is similar idea to log-linear graphs, or a graph of $cm$ vs $mm$.
# scaled x-axis by 2
plot(tz[1,:], tz[2,:], aspect_ratio=:2)
The arrow vectors are $\boldsymbol{A}$ orthogonal vectors in the scaled space, as there exist an $\boldsymbol{A}$ matrix (inverse operation of $\boldsymbol{T}$) that reverses the transformation step (“unwarps our space”). And we know that in the unwarped space, the arrow vectors are orthogonal.
#=
Vectors are 'A' orthogonal, as shown by taking the dot product
of 'unwarped' vectors
=#
(inv(T)*Δv_trans)[:,1]'*(inv(T)*Δv_trans)[:,2]
# Result:
# 0.0
Interactive Demo
1
1
Consider the quadratic problem
$f(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^T\boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}^T\boldsymbol{x}+c$
where $\boldsymbol{A}$ is a symmetric positive definite matrix, $\boldsymbol{x}$ and $\boldsymbol{b}$ are vectors, c some arbitrary constant.
The gradient is
\(\begin{aligned}
\nabla \boldsymbol{f}_x &= \frac{1}{2}\boldsymbol{A}^T\boldsymbol{x}+\frac{1}{2}\boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}\\
\nabla \boldsymbol{f}_x &= \boldsymbol{A}\boldsymbol{x}-\boldsymbol{b} &(since \ \boldsymbol{A} \ is \ symmetric)\\
\boldsymbol{0} &= \boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}\\
\boldsymbol{A}\boldsymbol{x}&=\boldsymbol{b}\\
\end{aligned}\)
The global minimizer to the quadratic problem can be found by solving $\boldsymbol{A}\boldsymbol{x}=\boldsymbol{b}$ if $\boldsymbol{A}$ is positive definite ($\boldsymbol{x}^T\boldsymbol{A}\boldsymbol{x}>0\ for\ all\ \boldsymbol{x}$)
Now for simplicity, I will abuse notation and treat unbold symbols as matrix/vectors.
Proof:
Let $x^*$ be the minimum point, $A$ positive definite matrix, $\delta $ be some perturbation.
\(\begin{aligned}
f(x^*+\delta) &= \frac{1}{2}(x^*+\delta)^TA(x^*+\delta)-b^T(x^*+\delta)+c\\
&= \frac{1}{2}(x^*)^TAx^*+\frac{1}{2}(x^*)^TA\delta+\frac{1}{2}\delta^TAx^*+\frac{1}{2}\delta^TA\delta-b^T(x^*+\delta)+c\\
&= \frac{1}{2}(x^*)^TAx^*-b^Tx^*+c+(\delta)^TAx^*-b^T\delta+\frac{1}{2}\delta^TA\delta\\
&= f(x^*)+\frac{1}{2}\delta^TA\delta\\
\end{aligned}\)
Since $\frac{1}{2}\delta^TA\delta>0$ due to positive definiteness, $x^*$ is the global minimum.
Given $p = \{d_1,d_2…d_n\}$ of mutually conjugate vectors, $P$ forms a set of basis of $\mathbb{R}^n$ and $A$ to be symmetrical positive definite matrix. Positive definiteness implies strict convexity i.e. there is a minimum solution and no saddle points.
If $d_i^TAd_j=0$ then $d_i$ is $A$ conjugate to $d_j$.
Now since $P$ is a set of $n$ basis vectors, we can write \(\begin{aligned} x^* = \sum_{i=1}^n\alpha_id_i \end{aligned}\) To solve for $\alpha_i$ we apply the A-conjugate property \(\begin{aligned} Ax^* &= \sum_{i=1}^n\alpha_iAd_i\\ d_j^TAx^* &= \sum_{i=1}^n\alpha_id_j^TAd_i\\ d_j^Tb &= \alpha_jd_j^TAd_j\\ \alpha_j &= \frac{d_j^Tb}{d_j^TAd_j} \end{aligned}\)
Orthogonal Directions
Orthogonal Direction is the idea that for each direction $d_i$ in a $\mathbb{R}^n$ space, the solution can be found in $n$ steps. This is done by having each iteration step through to the correct $x_i$ for the ith direction. e.g. If we imagine the direction vectors as the basis vectors, we are stepping through each basis vectors. So it takes 2 steps to get to the solution for a circular contour plot - 1st step to correct x coordinate, 2nd step to correct y coordinate.
Conjugate Directions
Taking the idea of the circular contuor plot of orthogonal direction, we can apply that idea to the A-conjugate case.
Let residual = -gradient. \(\begin{aligned} x_{j+1} &= x_j+\alpha_jd_j\\ \alpha_j &= \frac{d^T_jr_j}{d_j^TAd_j}\\ where \ r_j &= b-Ax_j \ is \ the \ jth \ residual\\ \end{aligned}\)
Proof: \(\begin{aligned} x^*-x_0 &= \sum_{i=1}^n\alpha_i d_i\\ x_j-x_0 &= \sum_{i=1}^{j-1}\alpha_i d_i\\ d_j^TA(x^*-x_0) &= d_j^TA(\sum_{i=1}^n\alpha_i d_i)\\ d_j^TA(x^*-x_0) &= \sum_{i=1}^n\alpha_i d_j^T A d_i\\ d_j^TA(x^*-x_0) &= \alpha_j d_j^T A d_j\\ \alpha_j &= \frac{d_j^TA(x^*-x_0) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^TA(x^*-x_j+x_j-x_0) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^TA(x^*-x_j) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^T(b-Ax_j) }{d_j^T A d_j}\\ \alpha_j &= \frac{d_j^Tr_j }{d_j^T A d_j}\\ \end{aligned}\)
Gram-Schmidt
Let ${u_0,u_2…u_{n-1}}$ be linearly independent vectors.
\[\begin{aligned} d_i &= u_i+\sum_{k=0}^{i-1}\beta_{ik} d_k\\ d_j^TAd_i &= d_j^TAu_i+d_j^TA\sum_{k=0}^{i-1}\beta_{ik} d_k \ where \ j<i\\ 0 &= d_j^TAu_i+d_j^TA\sum_{k=0}^{i-1}\beta_{ik} d_k\\ 0 &= d_j^TAu_i+d_j^TA\beta_{ij} d_j\\ \beta_{ij} &= -\frac{d_j^TAu_i}{d_j^TAd_j}\\ \beta_{ij} &= -\frac{u_i^TAd_j}{d_j^TAd_j}\\ \end{aligned}\]Conjugate Gradients
Conjugate gradients (CG) is essentially conjugate directions, setting $u$ to the residual $r$.
-
Update x \(\begin{aligned} x_{j+1} &= x_j+\alpha_jd_j\\ \alpha_j &= \frac{d_j^Tr_j }{d_j^T A d_j}\\ \end{aligned}\)
-
Update residual \(\begin{aligned} r_{j+1} &= -Ae_{j+1} \ (e_j=x_j-x^*)\\ r_{j+1} &= -A(e_j+\alpha_jd_j)\\ r_{j+1} &= r_j-\alpha_jAd_j\\ \end{aligned}\)
-
Update direction vector. As all $r_i$ are A orthogonal, $\beta_{ij} = -\frac{r_i^TAd_j}{d_j^TAd_j}$ becomes $\beta_{j+1} = -\frac{r_j^TAd_j}{d_j^TAd_j}$
Julia implementation
function conjugate_grad(A, x, b)
r = -(A*x-b)
d = r
while sum(abs2.(r))>0.1
α = d'*r/(d'*A*d)
println(x.+α*d)
println(α)
x = x .+ α*d
r = r .- α*A*d
β = -r'*A*d/(d'*A*d)
d = r .+ β*d
println(β)
println("r:",r)
end
x
end
conjugate_grad([4 1;1 3.], [45,1.],[1,2.])
Output:
[4.261940357227161, -9.41083746426417]
0.2263225535709602
0.020816994193039673
r:[-6.636923964644495, 25.97057203556534]
[0.09090909090908994, 0.6363636363636367]
0.4016793265837188
-4.387386080429377e-17
r:[-1.865174681370263e-14, -3.552713678800501e-15]
2-element Vector{Float64}:
0.09090909090908994
0.6363636363636367
$\beta$ simplification
$\beta$ can be further simplified by the following steps. \(\begin{equation} r_{j+1} = r_j - \alpha_jAd_j \tag{1}\label{eq:residual} \end{equation}\)
\[\begin{aligned} r_i^Tr_{j+1} &= r_i^Tr_j - \alpha_jr_i^TAd_j\\ r_i^TAd_j &= \frac{r_i^Tr_j - r_i^Tr_{j+1} }{\alpha_i} \\ &= \frac{- r_i^Tr_{j+1} }{\alpha_i} \ for \ i=j+1\\ \end{aligned}\]Using (\ref{eq:residual}) again \(\begin{aligned} r_{j+1} &= r_j - \alpha_jAd_j \\ d_j^Tr_{j+1} &= d_j^Tr_j - \alpha_jd_j^TAd_j\\ 0 &= d_j^Tr_j - \alpha_jd_j^TAd_j\\ \alpha_jd_j^TAd_j &= d_j^Tr_j\\ using \ d_i^T &= u_i^T+\sum_{k=0}^{i-1}\beta_{ik} d_k\\ d_i^Tr_j &= u_i^Tr_j+r_j\sum_{k=0}^{i-1}\beta_{ik} d_k\\ d_i^Tr_j &= u_i^Tr_j\\ \alpha_jd_j^TAd_j &= r_j^Tr_j\\ \end{aligned}\) Hence, \(\begin{aligned} \beta_{ij} &= -\frac{r_i^TAd_j}{d_j^TAd_j}\\ \beta_{j+1} &= \frac{r_{j+1}^Tr_{j+1}}{r_{j}^Tr_{j}} \end{aligned}\)
CG summary
To summarise the conjugate gradient algorithm,
Conjugate direction algorithm
Julia implementation example
function conjugate_grad2(A, x, b)
r = -(A*x-b)
d = r
while sum(abs2.(r))>0.1
α = d'*r/(d'*A*d)
println(x.+α*d)
println(α)
x = x .+ α*d
r_new = r .- α*A*d
β = r_new'*r_new/(r'*r)
d = r_new .+ β*d
r = r_new
println(β)
println("r:",r)
end
x
end
conjugate_grad2([4 1;1 3.], [-8,-8.], [1,2.])
Output:
[0.961248073959938, -0.5687211093990756]
0.21856702619414484
0.004482189026141918
r:[-2.276271186440681, 2.744915254237288]
[0.09090909090908939, 0.6363636363636365]
0.4159323228762778
3.101851013516526e-31
r:[1.7763568394002505e-15, 8.881784197001252e-16]
2-element Vector{Float64}:
0.09090909090908939
0.6363636363636365
Plot a quadratic curve and visualise CG steps.
function plot_quadf(A, b, c)
x = -10:1:10
y = -10:1:10
t3 = []
for i in x
for j in y
z = [i; j]
f = 0.5*z'*A*z.-b'*z.+c
push!(t3, f)
end
end
contour(y, x, t3)
plot!([-8, 0.961248073959938, 0.09090909090908283], [-8, -0.5687211093990756, 0.636], 2)
end
plot_quadf([4 1;1 3.], [1,2.], 0)