Least Squares

1. Least Squares Approximation
- 1.1. Uniqueness
- 1.2. QR Factorization
2. Least Squares Line
- 2.1. Least Squares Parabola

1. Least Squares Approximation

The least squares line is the line among a set of data points that minimizes the sum of the squares of the vertical distances from each data point to the line.

Suppose we represent the problem of finding this least squares solution as a matrix system with some \(m \times n\) matrix \(A\) and \(\mathbf{b} \in \mathbb{R}^m\). Then, we want to solve:

\begin{align} A\mathbf{x} = \mathbf{b} \notag \end{align}

for some vector \(\mathbf{x}\). However, it is highly likely that this system is inconsistent (e.g. no lines exactly passes through all the data points). But is there a point in \(\text{Col }A\) that is closest to \(\mathbf{b}\)?

This is just the projection of \(\mathbf{b}\) on \(\text{Col }A\), denoted as \(\mathbf{\hat{b}}\). Then, we can find some \(\mathbf{\hat{x}}\) such that \(A\mathbf{\hat{x}} = \mathbf{\hat{b}}\).

In other words, we want to find \(\mathbf{\hat{x}}\) such that \(\left| \mathbf{b} - A\mathbf{\hat{x}} \right|\) is as small as possible. To do this, note that \(\mathbf{b} -A\mathbf{\hat{x}}\) is orthogonal to \(\text{Col }A\), so \(\mathbf{b} - A\mathbf{\hat{x}}\) is in \(\text{Nul }A^T\), so:

\begin{align} A^T\left(\mathbf{b} - A\mathbf{\hat{x}}\right) = 0 \notag \end{align}

Then, distributing, we have:

\begin{align} A^TA\mathbf{\hat{x}} = A^T\mathbf{b} \notag \end{align}

Since \(A^TA\) is a square matrix, then \(\mathbf{\hat{x}}\) has a solution if it is invertible:

\begin{align} \boxed{\mathbf{\hat{x}} = \left(A^TA\right)^{-1}A^T\mathbf{b}} \end{align}

1.1. Uniqueness

We know that the system \(A^TA\mathbf{\hat{x}} = A^T\mathbf{b}\) is consistent since \(\mathbf{\hat{b}}\) is in \(\text{Col }A\). For the solution to be unique demands that \(A^TA\) be invertible.

In order to determine if \(A^TA\) is invertible, we can consider whether the system \(A^TA\mathbf{x}=0\) has a nontrivial solution. Treating \(A\mathbf{x}\) as one term, we can say that:

\begin{align} A\mathbf{x} &\in \text{Nul }A^T \notag \\ \Rightarrow A\mathbf{x} &\in \left(\text{Col }A\right)^{\perp} \notag \end{align}

At the same time, though, we know that \(A\mathbf{x} \in \text{Col }A\). Thus, this forces \(A\mathbf{x}=0\).

The problem is then equivalent to asking if \(A\mathbf{x}=0\) has only the trivial solution. Thus, there is a unique solution \(\mathbf{\hat{x}}\) exactly when the columns of \(A\) are linearly independent.

1.2. QR Factorization

If we know the QR Factorization of \(A\), we can rewrite (1) like so:

\begin{align} \mathbf{\hat{x}} &= \left[(QR)^TQR\right]^{-1}(QR)^T\mathbf{b} \notag \\ &= \left[R^TQ^TQR\right]^{-1}(QR)^T\mathbf{b} \notag \\ &= \left[R^TR\right]^{-1}R^TQ^T\mathbf{b} \notag \\ &= R^{-1}\left(R^T\right)^{-1}R^TQ^T\mathbf{b} \notag \end{align}

Thus, since \(\left(R^T\right)^{-1}R^T\) cancels out, we have:

\begin{align} \boxed{\mathbf{\hat{x}} = R^{-1}Q^T\mathbf{b}} \end{align}

2. Least Squares Line

We now want to find the least squares line of the form \(y = B_0 + B_1x\) for a set of data points \(\left(x_i,y_i\right)\). We can for each point solve the system as follows:

\begin{align} B_0 + B_1x_1 &= y_1 \notag \\ B_0 + B_1x_2 &= y_2 \notag \\ &\vdots \notag \\ B_0 + B_1x_n &= y_n \notag \end{align}

Writing this as a matrix equation, we have:

\begin{align} \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} B_0 \\ B_1 \end{bmatrix} &= \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \notag \\ \notag \\ XB &= Y \end{align}

where \(X\) is known as the design matrix.

This system, however, is likely inconsistent. Therefore, we can use least squares to choose \(B\) such that \(XB\) is the best approximation for \(Y\). Using (1), then, we know that:

\begin{align} B = \left(X^TX\right)^{-1}X^TY \end{align}

2.1. Least Squares Parabola

Say we want to fit a parabola using least squares instead. Then, we want to find some function of the form \(y = B_0 + B_1x + B_2x^2\), which means our matrix system becomes:

\begin{align} \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix} \begin{bmatrix}B_0 \\ B_1 \\ B_2 \end{bmatrix} = \begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_n\end{bmatrix} \notag \end{align}

The best approximation solution would still be the same as (4).