Have you ever wondered why some websites are placed over others on Google? It’s not just a random order but rather an order determined by Google’s PageRank algorithm. The basic concept of the algorithm is that more important webpages have more links on other webpages. In other words, the more links pointing to a specific page, the more important it is in the PageRank algorithm. However, the math underlying the algorithm is fascinating. It utilizes a fundamental concept in linear algebra called Markov Chains.
Like much of linear algebra, Markov Chains are closely related to statistics and probability. The basic idea is that each state within a Markov Chain is dependent on the previous state given a set of probabilities that correspond to each participant.
For example, imagine there are three islands: A, B, C. There are \( p_{A} \) birds on A, \( p_{B} \) on B, and \( p_{C} \) on C. Each year for A, 20% travel to B, 30% travel to C, and 50% stay on the island. For B, 10% travel to A, 80% travel to C, and 10% stay on B. Finally, for C, 5% travel to A, 90% travel to B, and the remaining 5% stay on C. What would be the final populations of each island after two years?
Since the population changes each year due to the migration we must use Markov Chains. We can model the probability matrix as the following equations:
\( p_{A+1} \) = 0.5\( p_{A} \) + 0.1\( p_{B} \) + 0.05\( p_{C} \)
\( p_{B+1} \) = 0.2\( p_{A} \) + 0.1\( p_{B} \) + 0.9\( p_{C} \)
\( p_{C+1} \) = 0.3\( p_{A} \) + 0.8\( p_{B} \) + 0.05\( p_{C} \)
If the populations of the islands are contained in a population vector \( p_{n} \) where the first, second, and third entries are \( p_{A} \), \( p_{B} \), and \( p_{C} \) respectively, and the probabilities in the probability square matrix A, then \( p_{n+1} \) is the matrix-vector product A\( p_{n} \). In this case, each row entry of A corresponds with the probability coefficients in the above equations. The elements of A are the following (\( a_{ij} \) = element at \( i^{th} \) row and \( j^{th} \) column): \( a_{11} \) = 0.5, \( a_{12} \) = 0.1, \( a_{13} \) = 0.05, \( a_{21} \) = 0.2, \( a_{22} \) = 0.1, \( a_{23} \) = 0.9, \( a_{31} \) = 0.3, \( a_{32} \) = 0.8, \( a_{33} \) = 0.05
To get the population of the first year, \( p_{1} =Ap_{0} \) where \( p_{0} \) represents the initial population vector. Then \( p_{2} =Ap_{1} \) or \( p_{2} =A(Ap_{0}) = (AA)p_{0}=A^{2}p_{0}\). In this case, the population vector after two years is \( p_{2} = [0.285p_{A} + 0.1p_{B} + 0.1175p_{C}; 0.39p_{A} + 0.79p_{B} + \)\( 0.145p_{C}; 0.325p_{A} + 0.15p_{B} + 0.7375p_{C}] \)
For n years after the initial population, \( p_{n} = A^{n}p_{0} \) where for large \( n, A^{n} \) will stabilize and converge to the eigenvectors A which can be determined by determining the eigenvalues from the Spectral Theorem and matrix decompositions. The Spectral Theorem makes computing large powers of A much simpler than by brute force.
In the context of PageRank, the entries of the probability matrix are the probabilities of a user moving from one page to another where the probabilities are determined by the quantity of functioning links. Using Markov Chains and the stabilized eigenvector over large iterations, the output is a probability vector that ranks the importance of each webpage.