5 Power Method and Markov Matrices
5.1 Markov Matrices
We will soon discuss iterative methods for solving some problems in linear algebra. These roughly work by taking a problem and a guess as input, then carrying out some operations which transform the guess into a (hopefully) better guess.
To motivate considering this kind of method, we will take a brief detour into Markov matrices, because they furnish a natural example of iteration in linear algebra, and how these iterates can calculate certain quantities.
Definition 5.1 A matrix \(M\) is called a Markov matrix if all its entries are nonnegative real numbers, and the entries in every column sum to \(1\). That is, for all \(j\), \[(\textrm{column } j \textrm{ sum}) = \sum_{i=1}^n M_{ij} = 1.\]
In general, a linear transformation \(T:V\rightarrow V\) is said to be Markov with respect to an ordered basis \(\beta\) if the matrix \([T]_\beta\) is a Markov matrix.
The Markov property is sensitive to the choice basis. It is almost never preserved by change of basis.
Example 5.1 Consider a Markov matrix \(M\) and some change-of-basis matrix \(J_\beta\) as follows: \[M = \begin{bmatrix} 0.1 & 0.5\\ 0.9 & 0.5 \end{bmatrix}\ \ \ \ J = \begin{bmatrix} 2 & 1\\ 5 & 3 \end{bmatrix}\]
With respect to the new basis, the Markov matrix becomes \[M_\beta = J_\beta M J_\beta ^{-1} = \begin{bmatrix} -4.2 & 1.9\\ -10.4 & 4.8 \end{bmatrix},\] which is definitely not a Markov matrix!
Nevertheless, many important properties of \(M\) are basis-independent, and we’ve already seen that sometimes interesting properties of a matrix or linear map are easier to calculate from a better choice of basis – you just have to be careful.
The usual way to think about the entries of \(M\) is as probabilities of transitioning from one state to another: the entry \(M_{ij}\) is the probability of making the transition \(i\leftarrow j\) from state \(i\) to state \(j\). Note that the arrow goes left.
In many cases, one can draw a nice “state diagram” to visualize this information.
Example 5.2 Consider a simple model of the weather. There are two states, sunny and rainy. If is raining today, it is probably less likely to rain tomorrow (clouds run out of water, say) so from a rainy day there is a \(20\%\) chance of rain the next day and a \(80\%\) chance of sun. If it is currently sunny, then there might be a \(60\%\) chance of sun again, and a \(40\%\) chance of rain.
The state diagram looks like this:
Use the usual basis \(e_1 =\begin{bmatrix}1\\0\end{bmatrix}\) and \(e_2 = \begin{bmatrix}0\\1\end{bmatrix}\), where we let \(e_1\) correspond to sunny weather and \(e_2\) to rainy weather. Then, with respect to this basis, the Markov matrix is
\[M = \begin{bmatrix} 0.6 & 0.8\\ 0.4 & 0.2 \end{bmatrix}.\]
Continuing the weather example, applying \(M\) to the sun vector gives \[Me_1 = \begin{bmatrix} 0.6 \\ 0.4 \end{bmatrix} = 0.6 e_1 + 0.4 e_2,\] in other words, \(0.6\) sun and \(0.4\) rain, a prediction of the weather tomorrow, if it were sunny today.
If we repeat this process, \[M^2e_1 = \begin{bmatrix} 0.68 \\ 0.32 \end{bmatrix}, \ \ M^3e_1 = \begin{bmatrix} 0.664 \\ 0.336 \end{bmatrix},\ \ ...\] we obtain a prediction of the weather two days after a sunny day, three days after, and so on.
With a computer’s help:
If we run the process on longer and longer, we get a vector describing the expected distribution of sunny and rainy days in the long run. This is a limit! The numbers suggest \(\lim_{n\rightarrow \infty} M^n e_1 = \begin{bmatrix} 2/3\\1/3 \end{bmatrix},\)
so if today is a sunny day, we expect to have sun two thirds of the time and rain one third of the time.
Thinking about weather, its long-term distribution shouldn’t depend on today’s particular weather. And indeed, you can check quickly that \[\lim_{n\rightarrow \infty} M^n e_2 = \begin{bmatrix} 2/3\\1/3 \end{bmatrix}\] as well!
In fact, this is even an eigenvector for \(M\), with eigenvalue \(1\).
So in this example, iteration calculates an eigenvector. The iteration doesn’t seem to depend on the starting point, and its entries are special: they’re the long term distribution of each state.
Indeed, if such a limit exists, then it must be an eigenvector
Theorem 5.1 Let \(M\) be a matrix and \(v\) a vector. If \(\lim_{n\to\infty} M^n v\) exists, then it is either zero or an eigenvector with eigenvalue \(1\).
Proof. On one hand, the limit doesn’t change if we shift the index, so \[\lim_{n\rightarrow \infty} M^{n+1} v = \lim_{n\rightarrow \infty} M^n v.\]
But also, linear maps are continuous, so we can factor out an \(M\) on the left hand side, resulting in \[M\left(\lim_{n\rightarrow \infty} M^n v\right) = \lim_{n\rightarrow \infty} M^n v.\]
If the limit isn’t zero, then this means it is an eigenvector with eigenvalue \(1\).
So long-term behavior is an eigenvector, which motivates defining
Definition 5.2 A vector \(e\) is called an equilibrium for a linear map \(T\) if \(T(e) = e\). Equivalently, \(e\) is an eigenvector with eigenvalue \(1\).
Our probabilistic reasoning suggests that a Markov matrix \(M\) always has \(1\) as an eigenvalue. It’s possible to prove that the limit above almost always converges to such an eigenvector, but there’s also a simple (but sneaky) algebraic proof that \(1\) is an eigenvalue.
Remember that the entries in the columns of a Markov matrix sum to \(1\). In other words, if you add up all the rows (as vectors) you get the vector \((1,1,...,1)\). Row operations correspond to left multiplication, and the multiplication for “add up all the rows” is precisely multiplying by \((1,1,...,1)\). In other words: \[(1,1,...,1) M = (1,1,...,1),\] which looks just like an eigenvector with eigenvalue \(1\), but on the left of the matrix!
Indeed, one can define left eigenvectors and eigenvalues and eigenspaces of matrices; this is the same as a right eigenvector/eigenvalue for \(M^T\), or equivalently eigenvectors/values of the dual transformation. The characteristic polynomial \(\det(xI-M)\) is the same for both left and right eigenvalues – every left eigenvalue is a right eigenvalue and vice versa
Therefore, the fact that there’s an easy-to-locate left eigenvector with eigenvalue \(1\) means that there must be a right eigenvector with eigenvalue \(1\). Even though the eigenvalues are the same on both sides, the eigenvectors have essentially nothing to do with each other – all markov matrices have the same \((1,1,...,1)\) left eigenvector, but their eigenvectors are essentially unconstrained (other than summing to \(1\).
So there are two surprising features of this proof: first, that it depends quite strongly on the choice of basis that makes the matrix Markov, even though eigenvalues are a basis-independent quantity, and second the translation from left eigenvalues to right eigenvalues.
5.2 Further Markov Facts
With some more advanced tools from linear algebra, it’s actually possible to prove more (on your homework, you verify the \(2\times 2\) versions of some of these by hand). I’ve included some names/references in case you’re curious. Maybe we will return to some of them later in the semester.
- all eigenvalues of a Markov matrix have norm at most \(1\) [Gershgorin Circle Theorem]
- an eigenvalue of a Markov matrix with norm \(1\) must be \(1\) if the matrix has no zeros on the diagonal [Gershgorin Circle Theorem]
- the eigenvalue \(1\) does not occur with multiplicity if \(M\) has no zero entries, or even if some power of \(M\) has no zero entries [Perron-Frobenius Theorem]
When the matrix is diagonalizable, this lets you prove that the limit above does almost always converge to an eigenvector with eigenvalue \(1\) (see homework). The only exceptions are when the starting point is a different eigenvector, in which case the limit is zero. In fact, even if it’s not diagonalizable the Jordan Normal Form can still be used to combine these facts to show that the limit exists.
The condition about diagonals is natural: it says from any state you have a chance to stay the same. This isn’t too uncommon. When that’s the case, the condition that a power of \(M\) have no zeros is also fairly reasonable: it says that from between any two states there is a path through the state diagram, following the directions, between them. If it doesn’t happen, the graph can often be divided up into smaller pieces that you can study individually.
5.3 Power Method
The limit calculation above suggests a more general way to calculate eigenvectors and eigenvalues. To start, let’s analyze the Markov case slightly more carefully.
Theorem 5.2 Let \(M\) be a Markov matrix, representing a map from \(V=\mathbb R^n\) to itself, and suppose it is diagonalizable. All of its eigenvalues have norm at most \(1\), and one of its eigenvalues is \(1\). If this is the unique largest eigenvalue and has associated eigenvector \(e\), then for almost all \(x\) (off a set of codimension at least \(1\)) the limit \[\lim_{n\to\infty} M^n x\] exists and is a scalar multiple of \(e\). To be precise, this is the case as long as \(x\) is not in the span of the other eigenvectors.
Proof. Apply the Gershgorin circle theorem to \(M^T\): this tells us all eigenvalues have norm at most \(1\), and we already know that \(1\) is an eigenvector. Since it is diagonalizable, there is an eigenbasis for \(M\): \(n\) distinct eigenvectors \(v_1,...,v_n\) with eigenvalues \(\lambda_1,...,\lambda_n\). Assume \(\lambda_1 = 1\) is the largest, so \(v_1=e\).
Given \(x\in V\), write it in terms of the basis as \[x = \sum_{i=1}^n x_i v_i.\]
It’s easy to apply \(M\) to \(x\) in terms of this basis:
\[\begin{align*} Mx &= \left(\sum_{i=1}^n x_i v_i\right),\\ &= \sum_{i=1}^n x_i Mv_i,\\ &= \sum_{i=1}^n x_i \lambda_iv_i. \end{align*}\]
If we apply \(M\) again, there will be another \(\lambda_i\) in the \(i\)th term. Therefore,
\[\begin{align*} M^nx &= \left(\sum_{i=1}^n x_i v_i\right),\\ &= \sum_{i=1}^n x_i M^nv_i,\\ &= \sum_{i=1}^n x_i \lambda_i^nv_i. \end{align*}\]
When \(i > 1\), we have assumed \(|\lambda_i| < 1\), and so \(\lambda_i \to \infty\) as \(n\to\infty\). Meanwhile, \(\lambda_i=1\), so the first term is unchanged and in the limit we are left with just \(x_1v_1 = x_1 e\). This is nonzero when \(x_1\) is nonzero, precisely when \(x\) is not in the span of the other eigenvectors, as was to be shown.
In fact, this is still true when \(M\) is not diagonalizable, as long as its largest eigenvector is unique. One can use either the Jordan Normal Form or the upper triangularization trick that appeared when relating norms to spectral radius at the end of the previous chapter.
Applying the same argument to matrices with less restricted eigenvalues, we see that the largest one (and its associated eigenvector) will dominate in the limit: if \(\lambda\) is the largest eigenvector, then \(|M^n x|\) grows like a power of \(\lambda\). This poses some issues, though, because it means that the limit will go to zero if the largest eigenvalue has norm less than 1, or that it will fail to converge if the norm is larger than \(1\). A tempting normalization would be to use \[(\frac 1 \lambda M)^n\] where \(\lambda\) is the largest eigenvalue, which would put us back in a situation like the Markov case. However, it would be better to come up with an iterative process that doesn’t use \(\lambda\), because calculating that value isn’t easy.
The trick is to normalize the sequence \(M^n x\) in terms of itself: its terms grow, but proportionally
Theorem 5.3 Let \(V=\mathbb R^n\) and \(T:V\to V\) a linear map. Assume that \(T\) is diagonalizable and has a unique largest (in absolute value) eigenvalue \(\lambda\) which is a positive real number. Then for almost all \(x\in V\), the following limit exists and is an eigenvector for \(\lambda\): \[\lim_{n\to\infty} \frac{M^nx}{|M^nx|}.\]
Proof. As before, pick an eigenbasis \(v_1,...,v_k\) with eigenvalues \(\lambda_1,...,\lambda_k\) and let \(\lambda_1 = \lambda\). Given \(x\in V\), write it in terms of the basis with coordinates \(x_i\). Then we can calculate:
\[M^n x = \sum_{i=1}^k x_i \lambda_i^n v_i.\]
Intuitively, and this can be made precise, \(\lambda_1^n\) is much larger than the other terms which change, and so $|M^n x| |x_1||v_1|_1^n. Here, positivity appears: \(|\lambda_1|=\lambda_1\), and the same for al its powers. Using this estimate,
\begin{align*} &{i=1}^k v_i,\ &= v_1 + {i=2}^k ()^n v_i}.
By assumption, \(|\lambda_i/\lambda_1| < 1\) when \(i\neq 1\), so powers of those terms go to zero, and hence \[\frac{M^n x}{|M^n x|} \to \frac{x_1}{|x_1||v_1|} v_1\] as \(n\to \infty\). This is an eigenvector with eigenvalue \(\lambda_1\) as long as \(x_1\neq 0\), which happens for almost all vectors \(x\).
Corollary 5.1 Continue from the assumptions in Theorem 5.3. For almost all \(x\), we have \[\lim_{n\to\infty} \frac{|M^{n+1}x|}{|M^n x|} = \lambda.\]
Proof. From Theorem 5.3, we know that
\[\lim_{n\to\infty} \frac{M^nx}{|M^n x|}\]
is an eigenvector with eigenvalue \(\lambda\), and so
\[ M \left(\lim_{n\to\infty} \frac{M^nx}{|M^n x|}\right) = \lambda \lim_{n\to\infty} \frac{M^nx}{|M^n x|}. \]
Move in the \(M\) on the left, by continuity, and take the norm (also continuous) of both sides:
\[ \left(\lim_{n\to\infty} \frac{|M^{n+1}x|}{|M^n x|}\right) = |\lambda| \lim_{n\to\infty} \frac{|M^nx|}{|M^n x|}. \]
The left hand side is the limit in question, while the right hand side is just \(|\lambda| = \lambda\), because \(\lambda\) is positiv
The restriction to positive real eigenvalues is required for the sequence in the main theorem to converge; if \(\lambda\) has a nonzero angular component, iteration will cause the vector to rotate around in complex coordinates even if we normalize. In an inner product space, one can account for this.
To use this method, it is much faster to evaluate \(Mx\) then \(M(Mx)\) and so on, rather than determine \(M^n\). The error term can be bounded in terms of fairly natural quantities.
Corollary 5.2 Suppose that \(M\) has a unique largest eigenvalue \(\lambda\), which is positive and real, as in Theorem 5.3. Let \(\lambda'\) be its second largest eigenvalue. There is some constant \(C\) such that the error term in the power method after \(n\) steps is at most \(C |\lambda'/\lambda|^n\).
Proof. The (absolute) error at step \(n\) is the norm of the difference between the final eigenvector \(\lim M^n x\) and the step \(n\) vector \(M^n x\). Inspecting the proof, we see that this quantity appeared already: it’s the sum which goes to zero as \(n\) goes to \(\infty\). So the error is the norm of
\[\sum_{i=2}^k \frac{x_i}{|x_1||v_1|}\left(\frac{\lambda_i}{\lambda}\right)^n v_i}\]
To bound this, we use the triangle inequality and the fact that \(|\lambda_i/\lambda_1 | \leq |\lambda'/\lambda|\) when \(i\neq 1\). This becomes
\begin{align*} |error| &{i=2}^k ||^n},\ &=({i=2}^k |) |^n.$$
Thus we may take \(C\) to be the sum.
Unfortunately, \(C\) can’t really be determined in practice, because it depends on \(x\) and the eigenbasis. At best, the dependence on the specific eigenbasis essentially washes out because it can only be changed by scaling the \(v_i\), which causes a commensurate rescaling of \(x_i\). But that does not actually help pin down \(C\) very well in terms of \(M\) and \(x\).