Least-squares and fitting curves through points

Given a bunch of data, how can we find a representative curve that approximates the data?

Let us understand this question through an example. Suppose that you’re out there trying to measure the resistance of a resistor by studying its Voltage vs Current using the relation \(V=I R\). Unfortunately, your equipment is not up to the standard you’d expect: it produces noisy data and you observe the following

{
  "data": [
    {
      "function": "randn(2*x, 1)",
      "xmin": 0,
      "xmax": 5,
      "points": 11,
      "mode": "markers",
      "name": "Measured Data",
      "marker": {
        "size": 10,
        "color": "#a463bf"
      }
    }
  ],
  "layout": {
    "title": {
      "text": "Voltage vs Current (Noisy Measurements)"
    },
    "xaxis": {
      "title": "Current (A)"
    },
    "yaxis": {
      "title": "Voltage (V)"
    }
  }
}

If only this was a nice straight line, then using \(V=I R\), you can immediately extract the resistance by finding its slope. However, the real world is messy and chaotic. So instead of wishing for what could have been, lets try to work with what we have: what if we could extract a straight line from the data? A “line of best fit” that passes through the points as close as possible. If we have that line, then the resistance is approximately the slope of that line. So a natural question to ask is how can one find this line?

Let’s denote this arbitrary line, i.e our model as a function \(f(x;m,c)\). A model is a function of your input variables (your control variables) and some parameters you want to estimate. Usually, a model is motivated by the physics and some underlying principles, but importantly it is something that you have to declare. In our case, the input variable is denoted by \(x\) (which is current here) and our parameters are the familiar parameters \(m\) (slope) and \(c\) (intercept) defining \(f(x; m,c)\) through

\[f(x;m,c)= m x +c.\]

The two parameters \((m,c)\) uniquely determines a line and therefore our special model is called a linear model. In principle, there are infinitely many values of \((m,c)\) and it is our goal to comb through this forest of \(m\) and \(c\) to find the correct \(m_0\) and \(c_0\) for which the line \(f(x;m_0,c_0)\) passes the “closest” to our data points. On that note, What does “closest” even mean?

What does closest mean?

Suppose you’ve collected \(N\) data points \(y_i\) for every input (control variable) value \(x_i\) where \(i=1,2,\dots,N\), then for every data point we can find \(c\) and \(m\) that minimizes the distance between the data point and your model that is given by \(\vert y_i - f(x_i; m,c)\vert\). This distance has a name, it is called the residue. We can mathematically write our process of minimizing the distance between our model and our data as over all the possible values of \(m\) and \(c\) as

\[\min_{m, c} \{\vert y_i-f(x_i;m ,c)\vert\} \text{ for all } i=1,2,\dots N.\]

This is fine in theory but not in practice as you’d have to sample all the infinitely many values of \(m\) and \(c\) which can take forever. However, you remember from Calculus 1 that an optimization problem can be written in terms of a derivative (set to \(0\)), so you might wonder if there is a way to implement this in our problem.

Minimization using Calculus

It turns out, you can rewrite this in terms of a function whose derivative set to \(0\) gives us the optimal solution, but we need to make a few compromises:

Change \(\vert X \vert\) to \(X^2\), since the absolute value function is notoriously non-differentiable at \(X=0\) and
Minimize the sum of the \(\text{distance}^2\) of all the data points instead of minimizing the distance individually.

(1) and (2) together is equivalent to minimizing the following objective function \(\mathcal{L}\) over all possible values of \(m\) and \(c\)

\[\mathcal{L}(m,c)=\sum_{i=1}^{N} [y_i- f(x_i; m,c)]^2.\]

The above formalism is called the least-squares method, since you want to minimize the (sum) squared distance of the measured data points and the model.

In our case, minimizing the least squares (our objective function here) is the same as finding solutions \((m,c)\) where the derivatives vanish.

What kind of derivatives?

If our model only had one parameter \(\theta\), then it would be an ordinary derivative \(\frac{d}{d\theta}\). However, since we’ve got two parameters, we need to consider the partial derivatives in both, i.e we solve

\[\begin{align} \frac{\partial}{\partial m} \mathcal{L}(m,c)&=0\\ \frac{\partial}{\partial c} \mathcal{L}(m,c)&=0 \end{align}\]

Solving for both simultaneously, we’ll be able to find the optimum solution \(m_0\) and \(c_0\) for which our line \(f(x;m_0,c_0)=m_0 x +c_0\) is the closet to our observed data. In the special case of a linear model , we can find the unique solution exactly (try working it out!).

Usually these calculations are done at the blink of an eye through existing packages like curve_fit() from scipy.optimize in python. Here’s what the least squares fit looks like for our noisy data:

{
  "data": [
    {
      "function": "randn(2*x, 1)",
      "xmin": 0,
      "xmax": 5,
      "points": 15,
      "mode": "markers",
      "name": "Measured Data",
      "marker": {
        "size": 10,
        "color": "#a463bf"
      }
    },
    {
      "fitFromTrace": 0,
      "points": 100,
      "mode": "lines",
      "name": "Least Squares Fit",
      "line": {
        "color": "#000101ff",
        "width": 3,
        "dash": "solid"
      }
    }
  ],
  "layout": {
    "title": {
      "text": "Voltage vs Current with Least Squares Fit"
    },
    "xaxis": {
      "title": "Current (A)"
    },
    "yaxis": {
      "title": "Voltage (V)"
    }
  }
}

Remark

An important takeaway from least squares fitting is that the entire problem is **merely** an optimization problem: for a fixed given set of data, find the curve that is the closest to it, i.e minimizes the least square error

What lies beyond the linear model?

After dabbling with the simple linear model, we may wonder what lies beyond it: What if we’d like to fit a parabola through your data points? What if an inverse square? For cases when linearization is possible, always linearize, i.e if your model was \(f(x; a,b)= a x^2 +b\), then set \(X=x^2\) and then you have a linear model \(g(X; a,b)= a X +b\). However, in cases where such a linearization is not possible and your model is something like \(a x^2 +b x +c\) then you continue systematically and solve the set of equations

\[\begin{align} \frac{\partial}{\partial a} \mathcal{L}(a,b,c)&=0\\ \frac{\partial}{\partial b} \mathcal{L}(a,b,c)&=0\\ \frac{\partial}{\partial c} \mathcal{L}(a,b,c)&=0 \end{align}\]

Model with multiple parameters

If you had a general model with \(p\) parameters \(\theta_1,\theta_2,\dots,\theta_M\) that we can gather as an array or vector commonly denoted with a bold \(\boldsymbol{\theta}\)

\[\boldsymbol{\theta}=(\theta_1,\theta_2,\dots \theta_M),\]

for a model \(f(x; \boldsymbol{\theta})\),

Then we solve the set of equations

\[\begin{align} \frac{\partial}{\partial \theta_1} \mathcal{L}(\boldsymbol{\theta})&=0\\ \frac{\partial}{\partial \theta_2} \mathcal{L}(\boldsymbol{\theta})&=0\\ \vdots & =0 \\ \frac{\partial}{\partial \theta_M} \mathcal{L}(\boldsymbol{\theta})&=0\\ \end{align}\]

for the optimum solution \(\boldsymbol{\hat{\theta}}\) which minimizes our objective function defined by

\[\mathcal{L}(\boldsymbol{\theta})= \sum_{i=1}^{N} [y_i - f(x_i; \boldsymbol{\theta})]^2.\]

Our optimal solution \(\boldsymbol{\hat{\theta}}\) brings our model \(f(x; \boldsymbol{\hat{\theta}})\) the closest to our data points.

Now you’ve figured out how to fit data and you happily move on with your life, succeeding in extracting the resistance until your friend comes along and performs the same experiment and observes the following

{
  "data": [
    {
      "function": "randn(2*x, 1)",
      "xmin": 0,
      "xmax": 5,
      "points": 11,
      "mode": "markers",
      "name": "Measured Data",
      "marker": {
        "size": 10,
        "color": "#a463bf"
      }
    }
  ],
  "layout": {
    "title": {
      "text": "Voltage vs Current (Friend's)"
    },
    "xaxis": {
      "title": "Current (A)"
    },
    "yaxis": {
      "title": "Voltage (V)"
    }
  }
}

You notice that their data is not exactly the same as yours (look above!). As special as your friend is to you, their data isn’t. Indeed, anyone perform the same set of measurement (called a trial) and get different data \(y_i\). At this stage, you realize that your data set is one out of the infinity many possible realizations, so it is natural to somehow factor that uncertainty in your measurement.

How do we do that you say? Through something called the \(\chi^2\) test. What the heck is \(\chi^2\) anyway?