5 Power Method and Markov Matrices

5.1 Markov Matrices

We will soon discuss iterative methods for solving some problems in linear algebra. These roughly work by taking a problem and a guess as input, then carrying out some operations which transform the guess into a (hopefully) better guess.

To motivate considering this kind of method, we will take a brief detour into Markov matrices, because they furnish a natural example of iteration in linear algebra, and how these iterates can calculate certain quantities.

Definition 5.1 A matrix \(M\) is called a Markov matrix if all its entries are nonnegative real numbers, and the entries in every column sum to \(1\). That is, for all \(j\), \[(\textrm{column } j \textrm{ sum}) = \sum_{i=1}^n M_{ij} = 1.\]

In general, a linear transformation \(T:V\rightarrow V\) is said to be Markov with respect to an ordered basis \(\beta\) if the matrix \([T]_\beta\) is a Markov matrix.

The Markov property is sensitive to the choice basis. It is almost never preserved by change of basis.

Example 5.1 Consider a Markov matrix \(M\) and some change-of-basis matrix \(J_\beta\) as follows: \[M = \begin{bmatrix} 0.1 & 0.5\\ 0.9 & 0.5 \end{bmatrix}\ \ \ \ J = \begin{bmatrix} 2 & 1\\ 5 & 3 \end{bmatrix}\]

With respect to the new basis, the Markov matrix becomes \[M_\beta = J_\beta M J_\beta ^{-1} = \begin{bmatrix} -4.2 & 1.9\\ -10.4 & 4.8 \end{bmatrix},\] which is definitely not a Markov matrix!

Nevertheless, many important properties of \(M\) are basis-independent, and we’ve already seen that sometimes interesting properties of a matrix or linear map are easier to calculate from a better choice of basis – you just have to be careful.

The usual way to think about the entries of \(M\) is as probabilities of transitioning from one state to another: the entry \(M_{ij}\) is the probability of making the transition \(i\leftarrow j\) from state \(i\) to state \(j\). Note that the arrow goes left.

In many cases, one can draw a nice “state diagram” to visualize this information.

Example 5.2 Consider a simple model of the weather. There are two states, sunny and rainy. If is raining today, it is probably less likely to rain tomorrow (clouds run out of water, say) so from a rainy day there is a \(20\%\) chance of rain the next day and a \(80\%\) chance of sun. If it is currently sunny, then there might be a \(60\%\) chance of sun again, and a \(40\%\) chance of rain.

The state diagram looks like this:

State diagram for the weather situation described above.

Use the usual basis \(e_1 =\begin{bmatrix}1\\0\end{bmatrix}\) and \(e_2 = \begin{bmatrix}0\\1\end{bmatrix}\), where we let \(e_1\) correspond to sunny weather and \(e_2\) to rainy weather. Then, with respect to this basis, the Markov matrix is

\[M = \begin{bmatrix} 0.6 & 0.8\\ 0.4 & 0.2 \end{bmatrix}.\]

Continuing the weather example, applying \(M\) to the sun vector gives \[Me_1 = \begin{bmatrix} 0.6 \\ 0.4 \end{bmatrix} = 0.6 e_1 + 0.4 e_2,\] in other words, \(0.6\) sun and \(0.4\) rain, a prediction of the weather tomorrow, if it were sunny today.

If we repeat this process, \[M^2e_1 = \begin{bmatrix} 0.68 \\ 0.32 \end{bmatrix}, \ \ M^3e_1 = \begin{bmatrix} 0.664 \\ 0.336 \end{bmatrix},\ \ ...\] we obtain a prediction of the weather two days after a sunny day, three days after, and so on.

With a computer’s help:

M = matrix([[0.6,0.8],[0.4,0.2]])
e1 = vector([1,0])

show('1 day:', M*e1)
show('2 days:', M^2*e1)
show('3 days:', M^3*e1)
show('4 days:', M^4*e1)
show('5 days:', M^5*e1)
show('.........')
show('30 days:', M^30*e1)

\(\displaystyle \verb|.........|\)

If we run the process on longer and longer, we get a vector describing the expected distribution of sunny and rainy days in the long run. This is a limit! The numbers suggest \(\lim_{n\rightarrow \infty} M^n e_1 = \begin{bmatrix} 2/3\\1/3 \end{bmatrix},\)

so if today is a sunny day, we expect to have sun two thirds of the time and rain one third of the time.

Thinking about weather, its long-term distribution shouldn’t depend on today’s particular weather. And indeed, you can check quickly that \[\lim_{n\rightarrow \infty} M^n e_2 = \begin{bmatrix} 2/3\\1/3 \end{bmatrix}\] as well!

In fact, this is even an eigenvector for \(M\), with eigenvalue \(1\).

So in this example, iteration calculates an eigenvector. The iteration doesn’t seem to depend on the starting point, and its entries are special: they’re the long term distribution of each state.

Indeed, if such a limit exists, then it must be an eigenvector

Theorem 5.1 Let \(M\) be a matrix and \(v\) a vector. If \(\lim_{n\to\infty} M^n v\) exists, then it is either zero or an eigenvector with eigenvalue \(1\).

Proof. On one hand, the limit doesn’t change if we shift the index, so \[\lim_{n\rightarrow \infty} M^{n+1} v = \lim_{n\rightarrow \infty} M^n v.\]

But also, linear maps are continuous, so we can factor out an \(M\) on the left hand side, resulting in \[M\left(\lim_{n\rightarrow \infty} M^n v\right) = \lim_{n\rightarrow \infty} M^n v.\]

If the limit isn’t zero, then this means it is an eigenvector with eigenvalue \(1\).

So long-term behavior is an eigenvector, which motivates defining

Definition 5.2 A vector \(e\) is called an equilibrium for a linear map \(T\) if \(T(e) = e\). Equivalently, \(e\) is an eigenvector with eigenvalue \(1\).

Our probabilistic reasoning suggests that a Markov matrix \(M\) always has \(1\) as an eigenvalue. It’s possible to prove that the limit above almost always converges to such an eigenvector, but there’s also a simple (but sneaky) algebraic proof that \(1\) is an eigenvalue.

Remember that the entries in the columns of a Markov matrix sum to \(1\). In other words, if you add up all the rows (as vectors) you get the vector \((1,1,...,1)\). Row operations correspond to left multiplication, and the multiplication for “add up all the rows” is precisely multiplying by \((1,1,...,1)\). In other words: \[(1,1,...,1) M = (1,1,...,1),\] which looks just like an eigenvector with eigenvalue \(1\), but on the left of the matrix!

Indeed, one can define left eigenvectors and eigenvalues and eigenspaces of matrices; this is the same as a right eigenvector/eigenvalue for \(M^T\), or equivalently eigenvectors/values of the dual transformation. The characteristic polynomial \(\det(xI-M)\) is the same for both left and right eigenvalues – every left eigenvalue is a right eigenvalue and vice versa

Therefore, the fact that there’s an easy-to-locate left eigenvector with eigenvalue \(1\) means that there must be a right eigenvector with eigenvalue \(1\). Even though the eigenvalues are the same on both sides, the eigenvectors have essentially nothing to do with each other – all markov matrices have the same \((1,1,...,1)\) left eigenvector, but their eigenvectors are essentially unconstrained (other than summing to \(1\).

So there are two surprising features of this proof: first, that it depends quite strongly on the choice of basis that makes the matrix Markov, even though eigenvalues are a basis-independent quantity, and second the translation from left eigenvalues to right eigenvalues.

5.2 Further Markov Facts

With some more advanced tools from linear algebra, it’s actually possible to prove more (on your homework, you verify the \(2\times 2\) versions of some of these by hand). I’ve included some names/references in case you’re curious. Maybe we will return to some of them later in the semester.

all eigenvalues of a Markov matrix have norm at most \(1\) [Gershgorin Circle Theorem]
an eigenvalue of a Markov matrix with norm \(1\) must be \(1\) if the matrix has no zeros on the diagonal [Gershgorin Circle Theorem]
the eigenvalue \(1\) does not occur with multiplicity if \(M\) has no zero entries, or even if some power of \(M\) has no zero entries [Perron-Frobenius Theorem]

When the matrix is diagonalizable, this lets you prove that the limit above does almost always converge to an eigenvector with eigenvalue \(1\) (see homework). The only exceptions are when the starting point is a different eigenvector, in which case the limit is zero. In fact, even if it’s not diagonalizable the Jordan Normal Form can still be used to combine these facts to show that the limit exists.

The condition about diagonals is natural: it says from any state you have a chance to stay the same. This isn’t too uncommon. When that’s the case, the condition that a power of \(M\) have no zeros is also fairly reasonable: it says that from between any two states there is a path through the state diagram, following the directions, between them. If it doesn’t happen, the graph can often be divided up into smaller pieces that you can study individually.

5.3 Power Method

The limit calculation above suggests a more general way to calculate eigenvectors and eigenvalues. To start, let’s analyze the Markov case slightly more carefully.

Theorem 5.2 Let \(M\) be a Markov matrix, representing a map from \(V=\mathbb R^n\) to itself, and suppose it is diagonalizable. All of its eigenvalues have norm at most \(1\), and one of its eigenvalues is \(1\). If this is the unique largest eigenvalue and has associated eigenvector \(e\), then for almost all \(x\) (off a set of codimension at least \(1\)) the limit \[\lim_{n\to\infty} M^n x\] exists and is a scalar multiple of \(e\). To be precise, this is the case as long as \(x\) is not in the span of the other eigenvectors.

Proof. Apply the Gershgorin circle theorem to \(M^T\): this tells us all eigenvalues have norm at most \(1\), and we already know that \(1\) is an eigenvector. Since it is diagonalizable, there is an eigenbasis for \(M\): \(n\) distinct eigenvectors \(v_1,...,v_n\) with eigenvalues \(\lambda_1,...,\lambda_n\). Assume \(\lambda_1 = 1\) is the largest, so \(v_1=e\).

Given \(x\in V\), write it in terms of the basis as \[x = \sum_{i=1}^n x_i v_i.\]

It’s easy to apply \(M\) to \(x\) in terms of this basis:

\[\begin{align*} Mx &= \left(\sum_{i=1}^n x_i v_i\right),\\ &= \sum_{i=1}^n x_i Mv_i,\\ &= \sum_{i=1}^n x_i \lambda_iv_i. \end{align*}\]

If we apply \(M\) again, there will be another \(\lambda_i\) in the \(i\)th term. Therefore,

\[\begin{align*} M^nx &= \left(\sum_{i=1}^n x_i v_i\right),\\ &= \sum_{i=1}^n x_i M^nv_i,\\ &= \sum_{i=1}^n x_i \lambda_i^nv_i. \end{align*}\]

When \(i > 1\), we have assumed \(|\lambda_i| < 1\), and so \(\lambda_i \to \infty\) as \(n\to\infty\). Meanwhile, \(\lambda_i=1\), so the first term is unchanged and in the limit we are left with just \(x_1v_1 = x_1 e\). This is nonzero when \(x_1\) is nonzero, precisely when \(x\) is not in the span of the other eigenvectors, as was to be shown.

In fact, this is still true when \(M\) is not diagonalizable, as long as its largest eigenvector is unique. One can use either the Jordan Normal Form or the upper triangularization trick that appeared when relating norms to spectral radius at the end of the previous chapter.

Applying the same argument to matrices with less restricted eigenvalues, we see that the largest one (and its associated eigenvector) will dominate in the limit: if \(\lambda\) is the largest eigenvector, then \(|M^n x|\) grows like a power of \(\lambda\). This poses some issues, though, because it means that the limit will go to zero if the largest eigenvalue has norm less than 1, or that it will fail to converge if the norm is larger than \(1\). A tempting normalization would be \[\left((\frac 1 \lambda M\right)^n\] where \(\lambda\) is the largest eigenvalue, since that would put us back in a situation similar to the Markov case. However, it would be better to come up with an iterative process that doesn’t use \(\lambda\), because we probably don’t know it – part of our goal is calculating it!

The trick is to normalize the sequence \(M^n x\) in terms of itself: its terms grow, but proportionally, so quotients of successive terms are more or less understood.

Theorem 5.3 Let \(V=\mathbb R^n\) and \(T:V\to V\) a linear map. Assume that \(T\) is diagonalizable and has a unique largest (in absolute value) eigenvalue \(\lambda\) which is a positive real number. Then for almost all \(x\in V\), the following limit exists and is an eigenvector for \(\lambda\): \[\lim_{n\to\infty} \frac{M^nx}{|M^nx|}.\]

Proof. As before, pick an eigenbasis \(v_1,...,v_k\) with eigenvalues \(\lambda_1,...,\lambda_k\) and let \(\lambda_1 = \lambda\). Given \(x\in V\), write it in terms of the basis with coordinates \(x_i\). Then we can calculate:

\[M^n x = \sum_{i=1}^k x_i \lambda_i^n v_i.\]

Intuitively, and this can be made precise, \(\lambda_1^n\) is much larger than the other terms which change, and so \(|M^n x| \approx |x_1||v_1|\lambda_1^n\). Here, positivity appears: \(|\lambda_1|=\lambda_1\), and the same for all its powers. Using this estimate,

\[\begin{align*} \frac{M^n x}{|M^n x|} &\approx \sum_{i=1}^k \frac{x_i \lambda_i^n}{|x_1||v_1|\lambda_1^n} v_i,\\ &= \frac{x_1}{|x_1||v_1|} v_1 + \sum_{i=2}^k \frac{x_i}{|x_1||v_1|}\left(\frac{\lambda_i}{\lambda_1}\right)^n v_i. \end{align*}\]

By assumption, \(|\lambda_i/\lambda_1| < 1\) when \(i\neq 1\), so powers of those terms go to zero, and hence \[\frac{M^n x}{|M^n x|} \to \frac{x_1}{|x_1||v_1|} v_1\] as \(n\to \infty\). This is an eigenvector with eigenvalue \(\lambda_1\) as long as \(x_1\neq 0\), which happens for almost all vectors \(x\).

Corollary 5.1 Continue from the assumptions in Theorem 5.3. For almost all \(x\), we have \[\lim_{n\to\infty} \frac{|M^{n+1}x|}{|M^n x|} = \lambda.\]

Proof. From Theorem 5.3, we know that

\[\lim_{n\to\infty} \frac{M^nx}{|M^n x|}\]

is an eigenvector with eigenvalue \(\lambda\), and so

\[ M \left(\lim_{n\to\infty} \frac{M^nx}{|M^n x|}\right) = \lambda \lim_{n\to\infty} \frac{M^nx}{|M^n x|}. \]

Move in the \(M\) on the left, by continuity, and take the norm (also continuous) of both sides:

\[ \left(\lim_{n\to\infty} \frac{|M^{n+1}x|}{|M^n x|}\right) = |\lambda| \lim_{n\to\infty} \frac{|M^nx|}{|M^n x|}. \]

The left hand side is the limit in question, while the right hand side is just \(|\lambda| = \lambda\), because \(\lambda\) is positive.

The restriction to positive real eigenvalues is required for the sequence in the main theorem to converge; if \(\lambda\) has a nonzero angular component, iteration will cause the vector to rotate around in complex coordinates even if we normalize. In an inner product space, one can account for this.

To use this method, it is much faster to evaluate \(Mx\) then \(M(Mx)\) and so on, rather than determine \(M^n\). The error term can be bounded in terms of fairly natural quantities.

Corollary 5.2 Suppose that \(M\) has a unique largest eigenvalue \(\lambda\), which is positive and real, as in Theorem 5.3. Let \(\lambda'\) be its second largest eigenvalue. There is some constant \(C\) such that the error term in the power method after \(n\) steps is at most \(C |\lambda'/\lambda|^n\).

Proof. The (absolute) error at step \(n\) is the norm of the difference between the final eigenvector \(\lim M^n x\) and the step \(n\) vector \(M^n x\). Inspecting the proof, we see that this quantity appeared already: it’s the sum which goes to zero as \(n\) goes to \(\infty\). So the error is the norm of

\[\sum_{i=2}^k \frac{x_i}{|x_1||v_1|} \left(\frac{\lambda_i}{\lambda}\right)^n v_i\]

To bound this, we use the triangle inequality and the fact that \(|\lambda_i/\lambda_1 | \leq |\lambda'/\lambda|\) when \(i\neq 1\). This becomes

\[\begin{align*} |error| &\leq \sum_{i=2}^k \frac{|x_i||v_i|}{|x_1||v_1|}\left|\frac{\lambda'}{\lambda}\right|^n,\\ &=\left(\sum_{i=2}^k \frac{|x_i||v_i|}{|x_1||v_1|}\right) \left|\frac{\lambda'}{\lambda}\right|^n. \end{align*}\]

Thus we may take \(C\) to be the sum, or any upper bound for it.

Unfortunately, \(C\) can’t really be determined in practice, because it depends on \(x\) and the eigenbasis. At best, the dependence on the specific eigenbasis essentially washes out because it can only be changed by scaling the \(v_i\), which causes a commensurate rescaling of \(x_i\). But that does not actually help pin down \(C\) very well in terms of \(M\) and \(x\).

Exercises

Exercise 5.1 Let \(M\) be a Markov matrix and assume it has no \(0\)s on the diagonal.

Prove that every eigenvalue of \(M\) has norm at most \(1\).
Prove that any eigenvalue of \(M\) with norm \(1\) is \(1\).

Exercise 5.2 Let \(M\) be a \(2\times 2\) Markov matrix and assume it has no \(0\)s on the diagonal. Suppose that \(1\) is a repeated eigenvalue.

What can you say about \(M\) if its eigenspace is full rank? Hint: diagonalize.
What can you say about \(M\) if the eigenspace is not full rank? Hint: upper triangularize.

Exercise 5.3 Suppose \(M\) is a Markov matrix with every entry nonzero. Show that it has a unique eigenvector whose associated eigenvalue has norm \(1\) (hence is \(1\)).

You may assume that Exercise 5.2(b) applies.

Exercise 5.4 Let \(M\) be a Markov matrix. Explain why the columns of \(M^n\) should converge to its equilibrium eigenvector as \(n\) goes to \(\infty\). Given an example where this does not happen.

Exercise 5.5 Rabbits can be fairly simple creatures. They exist in one of three states: annoyed, devious, or content.

Generally, they move among these states as follows: annoyed rabbits become devious, devious rabbits either become content (after exacting their revenge) or revert to annoyance (when their plans are foiled), and content rabbits become annoyed (when people are around).

Suppose that there aren’t too many people around, so the rate of content rabbits becoming annoyed is low. How do you expect rabbit states to be distributed at equilibrium?

Now consider a Markov matrix matching this situation: \[\begin{bmatrix} 0.2 & 0.45 & 0.1\\ 0.7 & 0.1 & 0.1\\ 0.1 & 0.45 & 0.8 % matrix([[0.2,0.45,0.1],[0.7,0.1,0.1],[0.1,0.45,0.8]]) \end{bmatrix}\]

Determine its equilibrium eigenvector. Did it match your expectations?

Now consider a different rabbit Markov matrix: \[\begin{bmatrix} 0.2 & 0.8 & 0.9\\ 0.7 & 0.0 & 0.0\\ 0.1 & 0.2 & 0.1 % matrix([[0.2,0.8,0.9],[0.7,0.0,0.0],[0.1,0.2,0.1]]) \end{bmatrix}\]

Describe a situation which might give rise to these kinds of probabilities. Use that to guess the shape of the equilibrium eigenvector. Calculate the actual equilibrium eigenvector, then compare.

It’s interesting that we can use concrete probabilistic intuitions to make predictions about something as abstract as an eigenvector!

Programming Problems

Exercise 5.6 The Perron-Frobenius theorem says that if a square matrix has positive real entries, then its largest eigenvalue is a real number and that its multiplicity is one. Therefore, our version of the power method is guaranteed to succeed.

Write a function which takes such a matrix as input and applies the power method to it. It should raise a ValueError if the matrix is not of this form.

Exercise 5.7 Markov’s original application of his namesake matrices was to study the distribution of vowels and consonants in Eugene Onegin. Other patterned linguistic data should be amenable to study by Markov Methods. For example, patterns of rhyme. In this problem we study a simpler variant, the tones in classical Chinese poetry.

Read a little about juéjù and lǜshī.

Based on your reading, propose two distinct and interesting ways to construct some Markov matrices out of tone datafrom a corpus of poems.

Then, write a small program which: - extracts the tone data from the provided corpus, - calculates each Markov matrix and its equilibrium eigenvector.