What the heck is \(\chi^2\) anyway?
How do you fit a curve through data with errors? What’s a statistic? and most importantly what is being squared?
These are important questions that you might encounter when you look up curve fitting! This is a continuation of how to fit data where I answer the former in Least-squares and fitting curves through points when the data is fixed. Before getting into what a statistic is, let’s first begin by answering the most common and important question here:
What is being squared?
Nothing! This is a trick question used by cruel people. Unfortunately, is a great example of an abuse of notation: \(\chi^2\) is a variable, just like \(x\) or \(y\). We could have called it \(\chi\), but for historical reasons, Pearson called it \(\chi^2\).
Now that we’ve tackled this important question, let us motivate the statistic with the same example we used previously:
You (and your friend) are trying to measure the resistance of a resistor by studying its Voltage vs Current using the relation \(V=I R\). Unfortunately, your equipment isn’t the greatest and you get noisy data that is “roughly” linear. Adding to that, neither of your sets of measurements are identical, i.e not only are they not on a line, they are different for the both of you!
{
"data": [
{
"function": "randn(2*x, 1)",
"xmin": 0,
"xmax": 5,
"points": 15,
"mode": "markers",
"name": "Trial 1",
"marker": {
"size": 10,
"color": "#a463bf"
}
}
],
"layout": {
"title": {
"text": "First Measurement"
},
"xaxis": {
"title": "Current (A)"
},
"yaxis": {
"title": "Voltage (V)"
}
}
}
{
"data": [
{
"function": "randn(2*x, 1)",
"xmin": 0,
"xmax": 5,
"points": 15,
"mode": "markers",
"name": "Trial 2",
"marker": {
"size": 10,
"color": "#a463bf"
}
}
],
"layout": {
"title": {
"text": "Second Measurement"
},
"xaxis": {
"title": "Current (A)"
},
"yaxis": {
"title": "Voltage (V)"
}
}
}
It seems that there isn’t just an uncertainty that prevents the data from landing perfectly on a line, but also in addition, each data point has uncertainties
Earlier we laid out the ideas behind curve fitting, but now we can now answer the more physical question of fitting data with uncertainties which is where \(\chi^2\) comes into play. The \(\chi^2\) statistic (pronounced chi-squared) is a metric used to asses a fit of a collection of data with errors. The introduction of errors changes the way we think of the optimization problem! Keep in mind the following remark before we proceed:
Remark
Unlike least squares which was purely a optimization problem that fits curves to a given set of data, \(\chi^2\) fitting additionally accounts for the errors the data might carry. Basically, \(\chi^2\) fitting is an optimization problem with statistics.
To work with \(\chi^2\), we first need some important ideas from statistics so we’ll start here. If you want to skip to using it, then click how to use \(\chi^2\) goodness of fit.
Some statistics overview
Dr. Strange and the very many different outcomes
You put on your thinking cap and recognize that each data point you measure is really one possible outcome that you experience. Like Dr. Strange using his ability to view various possible multiverses, you view each data point you see as a specific outcome, and the collection of all possible outcomes together is what you would get if you repeat the experiment infinitely many times. In this sense, we say that the the data points are sampled from a distribution which controls the probability of it showing up as a special number. Mathematically, what this means is that all the data points \(y_i\) are random variables.
What this implies is that the more data you measure, the more information you get about each data point and therefore the more accurate your results become. The true values of each data point should be the mean of all the data you have for that particular data point. There The variation is its standard error of the mean and you plot them together with error bars.
To get an intuitive feel for what this means, lets think about it in terms of an excel sheet. If you imagine an excel sheet keeping track of your measured data with each row corresponding to its trial number and each column being the measured values for the \(i^{\rm th }\) data point \(y_i\), then isolating a column (say the first column of \(y_1\)) over all trials gives you the different possible values that data point can \(y_i\) take. How do we use this information? We average!
The idea is to collapse the information of multiple plots by aggregating a specific column and plotting the corresponding averages as data points \(\bar{y}_i\)
{
"data": [
{
"function": "randn(2*x, 1)",
"xmin": 0,
"xmax": 5,
"points": 15,
"mode": "markers",
"name": "Data ",
"marker": {
"size": 10,
"color": "#a463bf"
},
"error_y": {
"type": "constant",
"value": 1.0,
"visible": true,
"color": "#a463bf",
"thickness": 2,
"width": 4
}
}
],
"layout": {
"title": {
"text": "Voltage vs Current with Error Bars"
},
"xaxis": {
"title": "Current (A)"
},
"yaxis": {
"title": "Voltage (V)"
}
}
}
where we call the error bars of size \(\sigma_{\bar{y}_i}\) as the standard error of the mean for each column.
Population and sample data sets, standard errors and the truth!
In statistics, a lot of similar sounding terms can be thrown around that can lead to confusion. Here I would like to clarify the difference between population and sample data set. The quantities “population mean”, “sample standard deviation” etc are known as a measured statistic.
In a nutshell, the population data set is the true data set, everyone and everything that you seek to study is present in the data set. Its mean (called the population mean denoted by \(\mu\)) and standard deviation (called the population standard deviation denoted by \(\sigma\)) is the true or correct statistic, i.e it is the underlying truth. However, in real life, it’s often not possible to get the whole collection, so we work with a small subset called the sample data set, i.e your data set. Ultimately, everything we do and measure to estimate the true statistic. To hammer down these points, remember:
-
The sample mean \(\bar{s}\) corresponds to the mean of your column, it is what we call an “estimator” of the true population mean
-
The sample standard deviation \(s\) is the standard deviation of your column computed using Bessel’s correction, it is also an “estimator” of the true population standard deviation.
So far, all of this is purely based on the data you collect, and nothing to do with the population mean or variance directly, only that they approximate or estimate the true values. Now comes the sample standard error (or standard error for short) that acts as a bridge between your data and the whole data set,
-
The sample standard error is sort of a bridge between the population and sample data set: it tells you roughly how far your sample mean is compared to the actual, true population mean. For \(N\) data points, it is defined by
\[\sigma_{\bar{s}}= \frac{s}{\sqrt{N}}\]
Now that we know of a method to encode the information form multiple trials through aggregating, we notice that for different data sets, we get different means \(\bar{y}_i\), i.e the means are “random”. This means that the means themselves are random variables. So we ask:
What distribution does the data point \(\bar{y}_i\) come from?
This question is a genuinely hard question that is not possible to answer exactly. The simplest answer is to make an educated guess. However, given certain reasonable conditions something truly spectacular happens: you can assume that the data point \(y_i\) comes from a Normal distribution \(\mathcal{N}(\mu_i,\sigma_i)\) through something called the Central Limit Theorem.
Before we get into this, let’s think about the weight of this conclusion. Out of the many distributions you know viz. the humble uniform distribution, or the Poisson (discrete) or Exponential (continuous) or the many many more distributions you most likely haven’t heard of1 like the Beta distribution or the Continuous Bernoulli Distribution, we arrive at the seemingly innocent Normal distribution!
This Normal distribution or the “bell shaped curve” with mean \(\mu\) and variance \(\sigma^2\) is given by the following equation
\[\mathcal{N}(\mu,\sigma^2)(x):= \frac{1}{\sqrt{2\sigma^2 \pi}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\]and is the renowned “Gaussian distribution” but normalized
\[\int_{-\infty}^{\infty} \mathcal{N}(\mu,\sigma^2)(x)dx=1.\]The Central Limit Theorem implies that the sample mean we were measuring \(\bar{y}_i\) using \(N\) trials comes from a normal distribution represented mathematically as
\[\bar{y}_i \sim \mathcal{N}\left(\mu,\frac{\sigma^2}{N}\right),\]where \(\mu_i\) is the true population mean and \(\sigma_i\). Mathematically, this means that our measured sample mean behaves (follows) a normal distribution with a mean that is the true population mean and standard deviation which is the standard error of \(\mu\). In practice, the population mean and standard deviation are unknown but we can approximate them using our data.
For our fitting purposes, we only need to estimate the standard error of the mean since the “mean” for our data points is what we want our optimization to pick out.
\[\frac{\sigma^2}{N}:=\sigma^2_{\mu}\approx \sigma^2_{\bar{s}}\]Central Limit Theorem: Why the Normal distribution?
The Central Limit Theorem (CLT) is what we used to justify the emergence of the Normal distribution. This theorem is colloquially stated along the lines of “more data means more bell shaped” but this isn’t quite the correct statement of the theorem and can lead to a lot of misunderstandings. There are a few conditions for the CLT to hold true, and they are nicely discussed in 3B1B’s “But what is the Central Limit Theorem?” that I will paraphrase below:
-
Independence: The data you collect should be independent of each other, i,e measuring one does not affect the next measurement
-
Same distribution: The data you collect, i.e the data points in the columns in your excel sheet should come from the same distribution
-
Finite variance: Your error bars must be finite, i.e more data does not increase the variance you observe for a single column.
Denote by \(y_i^{(j)}\) the data points in the \(i^{\rm th}\) column and \(j^{\rm th}\) row. Given the above conditions, CLT states that in the limit as you perform many many many trials (i.e the number of rows \(N\) goes to infinity),
The ratio of the difference between the quantity
\[\lim_{N\to \infty} \frac{\sum_{j=1}^N y_i^{(j)}}{N}\]which is just the average of our column with the true population mean over the population standard error follows
\[\lim_{N\to \infty} \frac{\frac{\sum_{j=1}^N y_i^{(j)}}{N}- \mu_i}{\frac{\sigma_i}{\sqrt{N}}}\sim \mathcal{N}(0,1)\]i.e a Normal distribution with mean zero and standard deviation one!
This means the average in a fixed column in the limit of more data points follows a normal distribution around the population mean \(\mu_i\) and standard deviation which is the standard error of the mean:
\[\lim_{N\to \infty} \frac{\sum_{j=1}^N y_i^{(j)}}{N}\sim \mathcal{N}\left(\mu_i,\frac{\sigma_i}{\sqrt{N}}\right)=\mathcal{N}\left(\mu_i,\sigma^2_{\mu}\right)\]All that jazz just to say that for our fitting purposes, we need to assume (with reasonable doubt) that the errors we measure or estimate are Gaussian errors, and of course, that we need to account for them.
Since we’re right at the intersection of pure mathematics where everything is “perfect” and experimental data where nature is unwieldy, I want to caution you to always challenge your beliefs with a relevant quote2 from the French physicist G. Lippmann (Nobel Prize 1908) paraphrased by the prominent mathematician H. Poincaré:
“Tout le monde y croit cependant, me disait un jour M. Lippmann, car les expérimentateurs s’imaginent que c’est un théorème de mathématiques, et les mathématiciens que c’est un fait expérimental.”
“Everyone is sure of this [that errors are normally distributed], Mr. Lippman told me one day, since the experimentalists believe that it is a mathematical theorem, and the mathematicians that it is an experimentally determined fact.” -G Lippmann to H. Poincaré (Calcul des probabilités 1912, p. 171)
How \(\chi^2\) goodness of fit is used
But how do we account for this errors in our fit? If we didn’t have errors for data point, then we’d stick to our usual least squares method and minimize the least squares. Essentially, we’re trying to make the distance between our model \(f(x;\boldsymbol{\theta})\) and our data points \(y_i\) small, but with errors, what does small mean even mean now, i.e small with respect to what?
In the case where data points have errors, a natural “scale” we’re trying to compare this is against the fluctuations–the error \(\sigma_i\) per data point. The error is estimated using multiple trials for a data point or guesstimated based on the order of magnitude. With this scale in mind, we minimize this sort of modified or scaled least squares which is what we call \(\chi^2\) defined as
\[\begin{equation}\chi^2 :=\sum_{i=1}^N \left(\frac{y_i- f(x;\boldsymbol{\theta})}{\sigma_i}\right)^2.\end{equation}\]If we’re fitting a line to our data points with errors (as in the example of \(V\) vs \(I\)), then we’d use \(f(x;m,c):=m x +c\) for the free parameters of the line \(m\) (slope) and \(c\) (intercept) where \(\boldsymbol{\theta}=(m,c)\). Since errors \(\sigma_i\) are just (estimated or guesstimated) and comes in as a multiplicative factor to residues, the procedure to find the optimal solution is identical to what we did in least squares and fitting curves through points. Since \(\chi^2\) is just least squares accounting for the error for each data point, in python, we’d still use scipy.optimize.curve_fit() with sigma (make sure to set absolute_sigma=True). Earlier, the smaller our least squares, the better fit our data is to our model, therefore we may naturally think that smaller \(\chi^2\) means better results, the question is if we would still be correct?
Note: scipy.optimize.curve_fit() does not return \(\chi^2\), it returns the parameters of your model i.e your fit. You’d have to compute \(\chi^2\) by using those parameters in Eq. (1)
Reduced \(\chi^2\) and what it means to be a “good” fit
\(\chi^2\) by itself for a particular fit does not tell us much: should it be very small? should it be very big? A common (but understandable) misconception is that since \(\chi^2\) comes from least squares, obtaining the smallest value of \(\chi^2\) should correspond to a good fit since that worked for least squares. However, this is not true! The quantity you need to check against is the so called reduced chi-squared defined by
\[\chi^2_{\nu} := \frac{\chi^2}{\nu}\]where
\[\begin{equation}\nu=N-p\end{equation}\]is called the degrees of freedom that you can find generally find by subtracting the number of data points with the number of independent parameters you used for the model. For example: If you had 10 data points that you’re trying to fit a line to, then \(\nu=10-2=8\). The degrees of freedom is crucial to our fit statistic since that is what allows us to decide how good our fit is after evaluating the fit’s \(\chi^2\).
What value of \(\chi^2_{\nu}\) is good? It turns out that the expected value (average denoted by \(\mathbb{E}\)) of \(\chi^2_{\nu}\) for a good fit is unity, i.e
\[\mathbb{E}[\chi^2_{\nu}] \approx 1\](with equality for a linear model in the parameters) meaning that a fit is “good” if
\[\chi^2_{\nu} \approx 1\]We will talk about why this is what we should expect under certain conditions later down the section.
Now that we have answered how to use and implement \(\chi^2\), we must address the elephant in the room: when do we use it?
When \(\chi^2\) goodness of fit is used
From what we did above, it might seem like \(\chi^2\) is a very well defined and ready to use statistic, however it turns out that it is a not a universally good fit statistic! More precisely, our assumption that a model \(f\) with \(p\) parameters and \(N\) data points having \(N-p\) degrees of freedom is not true and therefore our comparison of \(\chi^2_\nu\approx 1\) would be void.
Example: Independent parameters?
Consider the following simple three parameter model
\[f(x;A,B,C)= A+ (B+C) x\]After fitting our parameters and evaluating \(\chi^2\) at those parameters, we evaluate the degrees of freedom \(\nu=N-3\) to eventually check the value of \(\chi^2_{\nu}\). But wait! Is p=3? Does our model really have three independent parameters? Can your fit really separate \(B\) from \(C\) from the data? How could they even? It turns out that we can’t: we say that \(B\) and \(C\) are not independent from each other. The maximum number of independent parameters for this kind of model is two, and is given by \(f(x;M,C)=C+ M x\). In hindsight, we intuitively know that this model should only have two parameters, it’s just the general equation of a line!
Even if we’re careful with the problem of independent parameters, we are still not guaranteed \(\nu=N-p\). For instance, Andrae et. al in Do’s and dont’s of reduced chi-squared [Eq. (10)]3considers a pathological example that is particular illuminating
Example: Fitting to noise
Consider the three parameter model
\[f(x; A,B,C)= A \cos(Bx+C).\]It turns out that this model is capable of fitting any data given that no two data points have identical \(x\) values. Have a linear fit? use Eq. (10). Have an exponential fit? use Eq. (10) … see what I mean? Why is that intuitively so you ask? because given any data, you can change the amplitude to capture all the points that range over a fixed y value and then increase the frequency to densely wiggle near them. This means that the \(\chi^2\) can be arbitrary close to zero if you let the optimizer search for unbounded \(A,B,C\).
These examples illustrate the common pitfalls of \(\chi^2\), and is a cautionary tale in our quest to fitting data. However, in most cases, (polynomial fit, exponential, Gaussian & Lorentzian), under most reasonable circumstances (viz. the parameters are finite and sometimes non-zero), \(\chi^2\) is a pretty reasonable and useful fit statistic. I hope that this adventure has taught you a great deal about fitting and given you a chance to appreciate the nuances behind curve fitting. I will conclude an optional section with why \(\chi^2_\nu\approx N-p\).
Why do we expect \(\chi^2_{\nu}\approx 1\)?
The question can be answered in two parts. For a general non-linear model with \(p\) independent parameters that behaves nicely (i.e avoids the issues of \(\chi^2\)), we can taylor expand the model around the truth denoted by \(\boldsymbol{\theta}_0\) and we get a linear model. If we assume that our model is well behaved, then our solution from the optimization \(\boldsymbol{\hat{\theta}}\) should be close to the truth and therefore it makes sense to expand about it to linear order. Linearization therefore gives us a linear model locally, and allows us to work with a linear model. If we show that for a linear model \(\mathbb{E}[\chi^2_{\nu}]=N-p\), then we should expect our non-linear model to have \(\mathbb{E}[\chi^2]\approx N-p\) given that \(\boldsymbol{\hat{\theta}}\approx \boldsymbol{\theta}_0\).
Given these conditions, it is sufficient to work with a general linear model in the \(p\) parameters
\[f(x;\boldsymbol{\theta})= \boldsymbol{a(x_i)}\cdot \boldsymbol{\theta}=a_1(x) \theta_1+ a_2(x) \theta_2+\dots+a_p(x)\theta_p,\]where \(a_j(x)\) are arbitrarily nice functions of \(x\) but maybe different for different \(j\).
Consider now the residual vector with components \(i=1,\dots,N\) where \(N\) is the number of data points you collect, then \(\chi^2\) can be written in terms of the residual vector \(\boldsymbol{r}\) as
\(\chi^2(\boldsymbol{\theta})=\boldsymbol{r}(\boldsymbol{\theta})\cdot \boldsymbol{r}(\boldsymbol{\theta}),\) with components
\[r_i= \frac{y_i- f(x_i,\boldsymbol{\theta})}{\sigma_i}.\]Since the components are summed over \(a_j(x_i)\) for each \(i\), we express the whole vector \(\boldsymbol{r}\) in terms of a rectangular matrix \(A\) of size \(N\times p\) as
\[\boldsymbol{r}(\boldsymbol{\theta})= \boldsymbol{z}-A \boldsymbol{\theta}\]where
\[\begin{align} z_i&=\frac{y_i}{\sigma_i},\\ A_{ij}&= \frac{a_j(x_i)}{\sigma_i}. \end{align}\]Therefore, minimizing \(\chi^2\) is equivalent to the minimization problem
\[\chi^2(\boldsymbol{\theta})= (\boldsymbol{z}-A \boldsymbol{\theta})\cdot(\boldsymbol{z}-A \boldsymbol{\theta})= \vert\vert \boldsymbol{z}-A\boldsymbol{\theta}\vert\vert^2.\]How do we minimize this? use calculus!. Taking the partial derivatives of \(\theta_i\) for all of the \(p\) parameters and setting it to \(0\), we have
\[0=-2 A^T(\boldsymbol{z}-A \boldsymbol{\theta})\]which implies that the optimal solution is given explicitly as
\[\boldsymbol{\hat{\theta}}= A (A^T A)^{-1} A^T\]if \(A^T A\) which is an \(N\times N\) matrix is invertible. Plugging this back into our solution, we see the minimum value of \(\chi^2\) for a fixed data set using the optimal solution
\[\chi^2(\hat{\boldsymbol{\theta}})= \boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}=\boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}.\]Notice that the right hand side does not explicitly depend on \(\boldsymbol{\hat{\theta}}\) but rather the matrix \(A\) which is constructed from your data set. Now we imagine fitting this to another data set, and because it’s a different data set, we’d get a different \(\boldsymbol{\hat{\theta}^\prime}\) and so on and so forth. Taking all those data sets, and averaging the minimal \(\chi^2\) from all of them is equivalent to taking the expectation value \(\mathbb{E}[\chi^2]\) which gives
\[\mathbb{E}[\chi^2]= \mathbb{E}[\boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}].\]What now? It would seem like we’re stuck since we’ve got no information about \(\boldsymbol{z}\). But fret not! It turns out if we know the truth, then the data points we collect is defined to fluctuate as
\[z_i=\frac{y_i}{\sigma_i}= \frac{f(x_i;\boldsymbol{\theta}_0)}{\sigma_i}+ \epsilon_i\]where \(\epsilon_i\sim \mathcal{N}(0,1)\) which follows from our assumption that the errors are Gaussian. In our linear case,
\[\frac{f(x_i;\boldsymbol{\theta}_0)}{\sigma_i}=(A\boldsymbol{\theta})_i.\]In vector form,
\[\boldsymbol{z}= A\boldsymbol{\theta}+ \boldsymbol{\epsilon}\]where \(\boldsymbol{\epsilon}\) is a “random” vector centered with components centered around \(0\).
Thus, our \(\chi^2\) is written as
\[\chi^2=\boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}=\boldsymbol{\epsilon}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{\epsilon}\]Now let’s do some mathmagic! Using the definition of the product \(\boldsymbol{z}^TM\boldsymbol{z}\), we expand \(\boldsymbol{\epsilon}^T M \boldsymbol{\epsilon}\) to get
\[\mathbb{E}[\boldsymbol{\epsilon}^T M \boldsymbol{\epsilon}] = \mathbb{E}\left[\sum_{i,j=1}^N M_{ij}\epsilon_i\epsilon_j\right].\]We use the linearity of the expectation value and the key property that the expectation value of the product of two different gaussian random variables is zero but the variable squared is 1 to get
\[\mathbb{E}[\epsilon_i \epsilon_j]= \begin{cases}1 & \text{if } i=j,\\ 0 & \text{if } i\neq j.\end{cases}\]which gives
\[\mathbb{E}\left[\sum_{i,j=1}^N M_{ij}\epsilon_i\epsilon_j\right]=\sum_{i=1}^NM_{ii}:= {\rm tr}(M),\]i.e the trace of a matrix. Using the linearity of the trace and the cyclic property
\({\rm tr}(A+ B (DB)^{-1} D)= {\rm tr}(A) + {\rm tr}( DB (DB)^{-1})\), we have
\[\mathbb{E}[\chi^2]=\mathbb{E}[\boldsymbol{\epsilon}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{\epsilon}]= {\rm tr}(I)-{\rm tr}(A (A^T A)^{-1} A^T)=N-p\]For non-linear models, we linearize it around the optimal solution \(\boldsymbol{\hat{\theta}}\) and the argument follows identically to yield
\[\mathbb{E}[\chi^2]\approx N-p.\]