Machine learning is growing ever more prominent in society with the implementation of various models like support vector machines, neural networks, random forests, and cluster analysis. All these models are dependent on a foundational understanding of mathematics like linear algebra and calculus. The most simple but widely used model is linear regression which takes a collection of data points of n-dimensional data and finds the closest linear relationship between them. But how does this work? Does it just loop through all the possible linear equations and identify which best fits it? Of course, this would be impractical as there are an infinite amount of linear equations and the algorithm would take forever and may throw a stack overflow exception or timeout error.

Immediately, we can turn to mathematics and its unique properties to optimize this problem. Assuming the data contains different \(\mathrm{x}\) values for 2-dimensional data points (x,y), we can use linear algebra concepts to find the line that best fits the data points. First, we will start off by organizing our \(\mathrm{x}\) values into the \(\mathbf{X}\) vector and the \(\mathrm{y}\) values into the \(\mathbf{Y}\) vector, ensuring that each position of \(\mathbf{Y}\) is the same position of the corresponding \(\mathrm{x}\) value in \(\mathbf{X}\). We could then find the correlation coefficient \(r\) by finding the cosine of the angle made between \(\mathbf{X}\) and \(\mathbf{Y}\) with the formula: \(r=\cos \Theta=\frac{X \cdot Y}{\|X|\||Y||}\) where \(\mathbf{X}\) and \(\mathbf{Y}\) are the vectors. Though this step is not necessary to find the best-fit line, it provides with useful information that indicates the strength and direction of the relationship between the \(\mathrm{x}\) and \(\mathrm{y}\) values. From there we can create a plane \(\mathrm{P}\) that consists of infinite possible combinations of y values that create a perfect line given \(\mathbf{X}\). Using basic knowledge of linear equations, we know that a line can be horizontal so the \(\mathbf{1}\) vector (in which all the elements are 1) must lie in P. Thus, \(\mathrm{P}\) consists of all possible combinations of \(y\) values is spanned by \(\mathbf{X}\) and \(\mathbf{1}\), where each vector within \(\mathrm{P}\) is a linear combination of \(\mathbf{X}\) and \(\mathbf{1}\) meaning that there is a linear equation that perfectly fits \(\mathbf{X}\) and that specific combination of \(y\) values. However, most of the time in reality, it would not be possible to connect the coordinates with a singular straight line. For this reason, we have to find the vector in \(\mathrm{P}\) that is closest to \(\mathbf{Y}\). Though it’s possible to incorporate calculus to minimize the distance between \(\mathrm{P}\) and \(\mathbf{Y}\), we can just use our basic knowledge that the shortest distance between two objects is a line. So in this case, we can find a vector on \(\mathbf{P}\) that lies in the shadow cast by \(\mathbf{Y}\) onto \(\mathrm{P}\). Here, we can use a projection to accomplish just this. However to find \(\mathrm{Proj}_{\mathrm{P}}(\mathbf{Y})\) we must find an orthogonal basis to P. To keep things simple, we’ll just find the vector \(\hat{X}\) that lies in \(\mathrm{P}\) and is orthogonal (or perpendicular) to \(\mathbf{1}\). To do this, we just solve \(\hat{X}=\mathbf{X}\) – \(\mathrm{Proj}_{1}(\mathbf{X})\) which gives us \(\hat{X}=\mathbf{X}-\underline{x} \mathbf{1}\) where \(\underline{x}\) is the mean of all the \(x\) values. Now we can \(\mathrm{Projp}(\mathbf{Y})\) :

\(\mathrm{Proj}_{\mathrm{P}}(\mathbf{Y})=\mathrm{Proj}_{\hat{X}}(Y)+\mathrm{Proj}_{l}(Y)=\mathrm{Proj}_{\hat{X}}(Y)+\underline{y} 1=c \hat{X}+\underline{y} 1\) where \(\mathrm{c}\) is some constant that is known after computing \(\frac{\hat{X} \cdot Y}{\hat{X} \cdot \hat{X}}\). Since \(\hat{X}=\mathbf{X}-\underline{x} \mathbf{1}, \mathrm{Proj}(\mathbf{Y})=\mathrm{c}(\mathbf{X}-\underline{x} \mathbf{1})+\underline{y} 1=\mathrm{c} \mathbf{X}+(\underline{y}-\underline{c} \underline{\mathbf{x}}) \mathbf{1}\) If we substitute \(\mathrm{c}\) and \((\underline{y}-\mathrm{c} \underline{x})\) with \(\mathrm{m}\) and \(\mathrm{b}\) respectively, we get the linear combination representing the best-fit line: \(\mathrm{mX}+\mathrm{b} \mathbf{1}\). In algebra, this translates to \(\mathrm{y}=\mathrm{mx}+\mathrm{b}\).