What the heck is \(\chi^2\) anyway?

How do you fit a curve through data with errors? What’s a statistic? and most importantly what is being squared?

These are important questions that you might encounter when you look up curve fitting! This is a continuation of how to fit data where I answer the former in Least-squares and fitting curves through points when the data is fixed. Before getting into what a statistic is, let’s first begin by answering the most common and important question here:

What is being squared?

Nothing! This is a trick question used by cruel people. Unfortunately, this is a great example of an abuse of notation: \(\chi^2\) is a variable, just like \(x\) or \(y\). We could have called it \(\chi\), but for historical reasons K. Pearson, a giant in statistics, called it \(\chi^2\) probably to indicate that this quantity is \(\geq 0\).

Now that we’ve tackled this important question, let us motivate the statistic with the same example we used previously:

You (and your friend) are trying to measure the resistance of a resistor by studying its Voltage vs Current using the relation \(V=I R\). Unfortunately, your equipment isn’t the greatest and you get noisy data that is “roughly” linear. Adding to that, neither of your sets of measurements are identical, i.e not only are they not on a line, they are different for the both of you!

{
  "data": [
    {
      "function": "randn(2*x, 1)",
      "xmin": 0,
      "xmax": 5,
      "points": 15,
      "mode": "markers",
      "name": "Trial 1",
      "marker": {
        "size": 10,
        "color": "#a463bf"
      }
    }
  ],
  "layout": {
    "title": {
      "text": "First Measurement"
    },
    "xaxis": {
      "title": "Current (A)"
    },
    "yaxis": {
      "title": "Voltage (V)"
    }
  }
}
{
  "data": [
    {
      "function": "randn(2*x, 1)",
      "xmin": 0,
      "xmax": 5,
      "points": 15,
      "mode": "markers",
      "name": "Trial 2",
      "marker": {
        "size": 10,
        "color": "#a463bf"
      }
    }
  ],
  "layout": {
    "title": {
      "text": "Second Measurement"
    },
    "xaxis": {
      "title": "Current (A)"
    },
    "yaxis": {
      "title": "Voltage (V)"
    }
  }
}

It seems that there isn’t just an uncertainty that prevents the data from landing perfectly on a line, but also in addition, each data point has uncertainties

Earlier we laid out the ideas behind curve fitting, but now we can now answer the more physical question of fitting data with uncertainties which is where \(\chi^2\) comes into play. The \(\chi^2\) statistic (pronounced chi-squared) is a metric used to asses a fit of a collection of data with errors. The introduction of errors changes the way we think of the optimization problem! Keep in mind the following remark before we proceed:

Remark

Unlike least squares which was purely an optimization problem that fits curves to a given set of data, \(\chi^2\) fitting additionally accounts for the errors the data carries. Basically, \(\chi^2\) fitting is an optimization problem with statistics.

To work with \(\chi^2\), we first need some important ideas from statistics so we’ll start here. If you want to skip to using it, then click how to use \(\chi^2\) goodness of fit.

Some statistics overview

Dr. Strange and the very many different outcomes

You put on your thinking cap and recognize that each data point you measure is really one possible outcome that you experience. Like Dr. Strange using his ability to view various possible multiverses, you view each data point you see as a specific outcome, and the collection of all possible outcomes together is what you would get if you repeat the experiment infinitely many times. In this sense, we say that the the data points are sampled from a distribution which controls the probability of each measurement showing up as a specific value. Mathematically, what this means is that all the data points \(y_i\) are random variables.

What this implies is that the more data you measure, the more information you get about each data point and therefore the more accurate your results become. The true values of each data point should be approximately the mean of all the data you have for that particular data point. The variation is its standard error of the mean and you plot them together with error bars.

To get an intuitive feel for what this means, lets think about it in terms of an excel sheet. If you imagine an excel sheet keeping track of your measured data with each row corresponding to its trial number and each column being the measured values for the \(i^{\rm th }\) data point \(y_i\), then isolating a column (say the first column of \(y_1\)) over all trials gives you the different possible values that data point can \(y_i\) take. How do we use this information? We average!

The idea is to collapse the information of multiple plots by aggregating a specific column and plotting the corresponding averages as data points \(\bar{y}_i\)

{
  "data": [
    {
      "function": "randn(2*x, 1)",
      "xmin": 0,
      "xmax": 5,
      "points": 15,
      "mode": "markers",
      "name": "Data ",
      "marker": {
        "size": 10,
        "color": "#a463bf"
      },
      "error_y": {
        "type": "constant",
        "value": 1.0,
        "visible": true,
        "color": "#a463bf",
        "thickness": 2,
        "width": 4
      }
    }
  ],
  "layout": {
    "title": {
      "text": "Voltage vs Current with Error Bars"
    },
    "xaxis": {
      "title": "Current (A)"
    },
    "yaxis": {
      "title": "Voltage (V)"
    }
  }
}

where we call the error bars of size \(\sigma_{\bar{y}_i}\) as the standard error of the mean for each column.

Population and sample data sets, standard errors and the truth!

In statistics, a lot of similar sounding terms can be thrown around that can lead to confusion. Here I would like to clarify the difference between population and sample data set. The quantities “population mean”, “sample standard deviation” etc are known as a measured statistic.

In a nutshell, the population data set is the true data set, everyone and everything that you seek to study is present in the data set. Its mean (called the population mean denoted by \(\mu\)) and standard deviation (called the population standard deviation denoted by \(\sigma\)) is the true or correct statistic, i.e it is the underlying truth. However, in real life, it’s often not possible to get the whole collection, so we work with a small subset called the sample data set, i.e your data set. Ultimately, everything we do and measure is to estimate the true statistic. Keep in mind:

  • The sample mean \(\bar{s}\) corresponds to the mean of your column, it is what we call an “estimator” of the true population mean

  • The sample standard deviation \(s\) is the standard deviation of your column computed using Bessel’s correction, it is also an “estimator” of the true population standard deviation.

  • The sample standard error, which is the standard deviation of the mean given by

    \[\sigma_{\bar{s}}= \frac{s}{\sqrt{N}}.\]

The sample mean and standard deviation is pretty intuitive, however the standard error is often misunderstood due to its confusing notation. Measuring a bunch of noisy data gives you an average number which is known as the sample mean. The spread of the data is characterized by the sample standard deviation. When computing averages, you might have noticed that the averages for different data sets are “close” but not the same, i.e the averages themselves also spread. This spread, the spread of the means from different data sets is called the sample standard error, and is given by

\[\sigma_{\mu}= \frac{\mu}{\sqrt{N}}\]

if you know the true population mean \(\mu\), but we approximate it by the sample standard error \(\sigma_{\bar{s}}\).

Now that we know of a method to encode the information from multiple trials through aggregating, we notice that for different data sets, we get different means \(\bar{y}_i\), i.e the means are “random”. This means that the means themselves are random variables. So we ask:

What distribution does the data point \(\bar{y}_i\) come from?

This question is a genuinely hard question that is difficult to answer with absolute truth. The simplest answer is to make an educated guess. However, given certain reasonable conditions something truly spectacular happens: you can assume that the data point \(\bar{y}_i\) comes from a Normal distribution \(\mathcal{N}(\mu_i,\sigma_i)\) through something called the Central Limit Theorem.

Before we venture into this, let’s think about the weight of this conclusion. Out of the many distributions you know viz. the humble uniform distribution, or the Poisson distribution (discrete) or Exponential distribution (continuous) or the many many more distributions you most likely haven’t heard of1 like the Beta distribution or the Continuous Bernoulli Distribution, we arrive at the seemingly innocent Normal distribution!

This Normal distribution or the “bell shaped curve” with mean \(\mu\) and variance \(\sigma^2\) is given by the following equation

\[\mathcal{N}(\mu,\sigma^2)(x):= \frac{1}{\sqrt{2\sigma^2 \pi}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\]

and is the renowned “Gaussian distribution” but normalized

\[\int_{-\infty}^{\infty} \mathcal{N}(\mu,\sigma^2)(x)dx=1.\]

The equation for this distribution describes the probability density of observing a value \(x\). If a random variable is governed by the normal distribution, then the probability that this variable takes a value between two numbers \(a<b\) is exactly the area under the curve!

The Central Limit Theorem implies that the sample mean we were measuring \(\bar{y}_i\) using \(N\) trials comes from a normal distribution represented mathematically as

\[\bar{y}_i \sim \mathcal{N}\left(\mu_i,\frac{\sigma_i^2}{N}\right),\]

where \(\mu_i\) is the true population mean and \(\sigma_i\). Mathematically, this means that our measured sample mean behaves (follows) a normal distribution with a mean that is the true population mean and standard deviation which is the standard error of \(\mu_i\). In the voltage vs current scenario, the population mean is the “true” reading of voltage at a fixed current, and the standard deviation describes the fluctuation width around that value. In practice, the population mean and standard deviation are unknown but we can approximate them using our data. Taking more and more data, we can take averages and if the errors are Gaussian, then average of our voltages for a fixed current converges to a number and its width, the standard error becomes small.

For our fitting purposes however, we only estimate the standard deviation of each data point since the “mean” for our data points is what we want our optimization to pick out.

Central Limit Theorem: Why the Normal distribution?

The Central Limit Theorem (CLT) is what we used to justify the emergence of the Normal distribution. This theorem is colloquially stated along the lines of “more data means more bell shaped” but this isn’t quite the correct statement of the theorem and can lead to a lot of misunderstandings. There are a few conditions for the CLT to hold, and they are nicely discussed in 3B1B’s “But what is the Central Limit Theorem?” that I will paraphrase below:

  1. Independence: The data you collect should be independent of each other, i.e measuring one does not affect the next measurement.

  2. Same distribution: The data you collect, i.e the data points in the columns in your excel sheet should come from the same distribution.

  3. Finite variance: Your error bars must be finite, i.e more data does not increase the variance you observe for a single column.

Denote by \(y_i^{(j)}\) the data points in the \(i^{\rm th}\) column and \(j^{\rm th}\) row. Given the above conditions, CLT states that in the limit as you perform many many many trials (i.e the number of rows \(N\) goes to infinity),

The ratio of the difference between the quantity

\[\lim_{N\to \infty} \frac{\sum_{j=1}^N y_i^{(j)}}{N}\]

which is just the average of our column with the true population mean over the population standard error follows

\[\lim_{N\to \infty} \frac{\frac{\sum_{j=1}^N y_i^{(j)}}{N}- \mu_i}{\frac{\sigma_i}{\sqrt{N}}}\sim \mathcal{N}(0,1)\]

i.e a Normal distribution with mean zero and standard deviation one!

This means the average of a fixed column in the limit of more data points follows a normal distribution around the population mean \(\mu_i\) and variance which is the square of the standard error of the mean:

\[\lim_{N\to \infty} \frac{\sum_{j=1}^N y_i^{(j)}}{N}\sim \mathcal{N}\left(\mu_i,\frac{\sigma_i^2}{N}\right)=\mathcal{N}\left(\mu_i,\sigma^2_{\mu}\right)\]

All that jazz just to say that for our fitting purposes, we need to assume (with reasonable doubt) that the errors we measure or estimate are Gaussian errors, and of course, that we need to account for them.

Since we’re right at the intersection of pure mathematics where everything is “perfect” and experimental data where nature is unwieldy, I want to caution you to always challenge your beliefs with a relevant quote2 from the French physicist G. Lippmann (Nobel Prize 1908) paraphrased by the prominent mathematician H. Poincaré:

“Tout le monde y croit cependant, me disait un jour M. Lippmann, car les expérimentateurs s’imaginent que c’est un théorème de mathématiques, et les mathématiciens que c’est un fait expérimental.”


“Everyone is sure of this [that errors are normally distributed], Mr. Lippman told me one day, since the experimentalists believe that it is a mathematical theorem, and the mathematicians that it is an experimentally determined fact.” -G Lippmann to H. Poincaré (Calcul des probabilités 1912, p. 171)

How \(\chi^2\) goodness of fit is used

But how do we account for these errors in our fit? If we didn’t have errors for our data point, then we’d stick to our usual least squares method and minimize the least squares. Essentially, we’re trying to make the distance between our model \(f(x;\boldsymbol{\theta})\) and our data points \(y_i\) small, but with errors, what does small even mean now?, i.e small with respect to what?

In the case where data points have errors, a natural “scale” we’re trying to compare with is against the fluctuations–the error \(\sigma_i\) per data point. The error is estimated using the sample standard deviation from multiple trials for a data point or guesstimated based on the order of magnitude. With this scale in mind, we minimize this sort of modified or scaled least squares which is what we call \(\chi^2\) defined as

\[\begin{equation}\chi^2 :=\sum_{i=1}^N \left(\frac{y_i- f(x;\boldsymbol{\theta})}{\sigma_i}\right)^2.\end{equation}\]

If we’re fitting a line to our data points with errors (as in the example of \(V\) vs \(I\)), then we’d use \(f(x;m,c):=m x +c\) for the free parameters of the line \(m\) (slope) and \(c\) (intercept) where \(\boldsymbol{\theta}=(m,c)\). Since errors \(\sigma_i\) are just (estimated or guesstimated) and comes in as a multiplicative factor to residues, the procedure to find the optimal solution is identical to what we did in least squares and fitting curves through points. Since \(\chi^2\) is just “least squares with errors for each data point”, we’d still use scipy.optimize.curve_fit() in python with sigma (make sure to set absolute_sigma=True). Earlier, the smaller our least squares, the better fit our data is to our model, therefore we may naturally think that smaller \(\chi^2\) means better results; the question now is if this assumption still holds.

Note: scipy.optimize.curve_fit() does not return \(\chi^2\), it returns the parameters of your model i.e your fit. You’d have to compute \(\chi^2\) by using those parameters in Eq. (1)

Reduced \(\chi^2\) and what it means to be a “good” fit

\(\chi^2\) by itself for a particular fit does not tell us much: should it be very small? should it be very big? A common (but understandable) misconception is that since \(\chi^2\) comes from least squares, obtaining the smallest value of \(\chi^2\) should correspond to a good fit since that worked for least squares. However, this is not true! The quantity you need to check against is the so called reduced chi-squared defined by

\[\chi^2_{\nu} := \frac{\chi^2}{\nu}\]

where

\[\begin{equation}\nu=N-p\end{equation}\]

is called the degrees of freedom that you can generally find by subtracting the number of data points with the number of independent parameters you used for the model. For example: If you had 10 data points that you’re trying to fit a line to, then \(\nu=10-2=8\). The degrees of freedom is crucial to our fit statistic since that is what allows us to decide how good our fit is after evaluating the fit’s \(\chi^2\).

What value of \(\chi^2_{\nu}\) is good? It turns out that the expected value (average denoted by \(\mathbb{E}\)) of \(\chi^2_{\nu}\) for a good fit is unity, i.e

\[\mathbb{E}[\chi^2_{\nu}] \approx 1\]

(with equality for a linear model in the parameters) meaning that a fit is “good” if

\[\chi^2_{\nu} \approx 1\]

We will talk about why this is what we should expect under certain conditions later down the section.

Now that we have answered how to use and implement \(\chi^2\), we must address the elephant in the room: when do we use it?

When \(\chi^2\) goodness of fit is used

From what we did above, it might seem like \(\chi^2\) is a very well defined and ready to use statistic, however it turns out that it is a not a universally good fit statistic! More precisely, our assumption that a model \(f\) with \(p\) parameters and \(N\) data points having \(N-p\) degrees of freedom is not true and therefore our comparison of \(\chi^2_\nu\approx 1\) would be void.

Example: Independent parameters?

Consider the following simple three parameter model

\[f(x;A,B,C)= A+ (B+C) x\]

After fitting our parameters and evaluating \(\chi^2\) at those parameters, we evaluate the degrees of freedom \(\nu=N-3\) to eventually check the value of \(\chi^2_{\nu}\) and draw our conclusions from it. But wait! Is p=3? Does our model really have three independent parameters? Can your fit really separate \(B\) from \(C\) from the data? How could they even? It turns out that we can’t: we say that \(B\) and \(C\) are not independent from each other. The maximum number of independent parameters for this kind of model is two, and is given by \(f(x;M,C)=C+ M x\). In hindsight, we intuitively know that this model should only have two parameters, it’s just the general equation of a line!

Even if we’re careful with the problem of independent parameters, we are still not guaranteed \(\nu=N-p\). For instance, Andrae et. al in Do’s and dont’s of reduced chi-squared [Eq. (10)] considers a pathological example that is particular illuminating3

Example: Fitting to noise

Consider the three parameter model

\[f(x; A,B,C)= A \cos(Bx+C).\]

It turns out that this model is capable of fitting any data given that no two data points have identical \(x\) values. Have a linear fit? use Eq. (10). Have an exponential fit? use Eq. (10) … see what I mean? Why is that intuitively so you ask? because given any data, we can tune the amplitude to capture all the points that range over a fixed y value and then increase the frequency to densely wiggle near them. This means that the \(\chi^2\) can be arbitrary close to zero if you let the optimizer search for unbounded \(A,B,C\).

These examples illustrate the common pitfalls of \(\chi^2\), and is a cautionary tale in our quest to fitting data. However, in most cases, (polynomial fit, exponential, Gaussian & Lorentzian), under most reasonable circumstances (viz. the parameters are finite and sometimes non-zero), \(\chi^2\) is a pretty reasonable and useful fit statistic. I hope that this adventure has taught you a great deal about fitting and given you a chance to appreciate the nuances behind curve fitting. I will conclude this article with an optional section on why \(\chi^2_\nu\approx N-p\).

Why do we expect \(\chi^2_{\nu}\approx 1\)?

The question can be answered in two parts. For a general non-linear model with \(p\) independent parameters that behaves nicely (i.e avoids the issues of \(\chi^2\)), we can Taylor expand the model around the truth denoted by \(\boldsymbol{\theta}_0\) and we get a linear model. If we assume that our model is well behaved, then our solution from the optimization \(\boldsymbol{\hat{\theta}}\) should be close to the truth and therefore it makes sense to expand about the truth after fitting to linear order since \(\vert\vert \boldsymbol{\theta}_0^2-\boldsymbol{\hat{\theta}}\vert\vert^2\approx 0\). For this local linear model, if we show that \(\mathbb{E}[\chi^2_{\nu}]=N-p\), then we should expect our non-linear model to have \(\mathbb{E}[\chi^2]\approx N-p\) given that \(\boldsymbol{\hat{\theta}}\approx \boldsymbol{\theta}_0\).

Given these conditions, it is sufficient to work with a general linear model in the \(p\) parameters

\[f(x;\boldsymbol{\theta})= \boldsymbol{a(x_i)}\cdot \boldsymbol{\theta}=a_1(x) \theta_1+ a_2(x) \theta_2+\dots+a_p(x)\theta_p,\]

where \(a_j(x)\) are arbitrarily nice functions of \(x\) but maybe different for different \(j\).

Consider now the residual vector with components \(i=1,\dots,N\) where \(N\) is the number of data points you collect, then \(\chi^2\) can be written in terms of the residual vector \(\boldsymbol{r}\) as

\[\chi^2(\boldsymbol{\theta})=\boldsymbol{r}(\boldsymbol{\theta})\cdot \boldsymbol{r}(\boldsymbol{\theta}),\]

with components

\[r_i= \frac{y_i- f(x_i,\boldsymbol{\theta})}{\sigma_i}.\]

Since the components are summed over \(a_j(x_i)\) for each \(i\), we express the whole vector \(\boldsymbol{r}\) in terms of a rectangular matrix \(A\) of size \(N\times p\) as

\[\boldsymbol{r}(\boldsymbol{\theta})= \boldsymbol{z}-A \boldsymbol{\theta}\]

where

\[\begin{align} z_i&=\frac{y_i}{\sigma_i},\\ A_{ij}&= \frac{a_j(x_i)}{\sigma_i}. \end{align}\]

Therefore, minimizing \(\chi^2\) is equivalent to the minimization problem

\[\chi^2(\boldsymbol{\theta})= (\boldsymbol{z}-A \boldsymbol{\theta})\cdot(\boldsymbol{z}-A \boldsymbol{\theta})= \vert\vert \boldsymbol{z}-A\boldsymbol{\theta}\vert\vert^2.\]

How do we minimize this? use calculus! Taking the partial derivatives of \(\theta_i\) for all of the \(p\) parameters and setting it to \(0\), we have

\[0=-2 A^T(\boldsymbol{z}-A \boldsymbol{\theta})\]

which implies that the optimal solution is given explicitly as

\[\boldsymbol{\hat{\theta}}= (A^T A)^{-1} A^T\]

if \(A^T A\) which is an \(N\times N\) matrix is invertible. This means that

\[\boldsymbol{z}- A \boldsymbol{\theta}= z- A \left((A^T A)^{-1} A^T\right)\]

Plugging this back into our solution, we see the minimum value of \(\chi^2\) for a fixed data set using the optimal solution

\[\chi^2(\hat{\boldsymbol{\theta}})= \boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}=\boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}.\]

where \(I_N\) is the \(N\times N\) identity matrix. Notice that the right hand side does not explicitly depend on \(\boldsymbol{\hat{\theta}}\) but rather the matrix \(A\) which is constructed from your data set. Now we imagine fitting this to another data set, and because it’s a different data set, we’d get a different \(\boldsymbol{\hat{\theta}^\prime}\) and so on and so forth. Taking all those data sets, and averaging the minimal \(\chi^2\) from all of them is equivalent to taking the expectation value \(\mathbb{E}[\chi^2]\) written as

\[\mathbb{E}[\chi^2]= \mathbb{E}[\boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}].\]

What now? It would seem like we’re stuck since we’ve got no information about \(\boldsymbol{z}\). But fret not! It turns out that then the data points we collect are defined to fluctuate as

\[z_i=\frac{y_i}{\sigma_i}= \frac{f(x_i;\boldsymbol{\theta}_0)}{\sigma_i}+ \varepsilon_i\]

where \(\varepsilon_i\sim \mathcal{N}(0,1)\) which follows from our assumption that the errors are Gaussian. In our linear case,

\[\frac{f(x_i;\boldsymbol{\theta}_0)}{\sigma_i}=(A\boldsymbol{\theta}_0)_i.\]

In vector form,

\[\boldsymbol{z}= A\boldsymbol{\theta}_0+ \boldsymbol{\varepsilon}\]

where \(\boldsymbol{\varepsilon}\) is a “random” vector with components centered around \(\boldsymbol{0}\).

Thus, our \(\chi^2\) is written as

\[\chi^2=\boldsymbol{z}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{z}=\boldsymbol{\varepsilon}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{\varepsilon},\]

where we have used the fact that

\[(I_N- A (A^T A)^{-1} A^{T})(A \boldsymbol{\theta}_0+\boldsymbol{\varepsilon})= (I_N- A (A^{T}A)^{-1}A^{T}) \boldsymbol{\varepsilon}\]

Now let’s do some mathmagic! Using the definition of the product \(\boldsymbol{x}^TM\boldsymbol{x}\), we expand \(\boldsymbol{\varepsilon}^T M \boldsymbol{\varepsilon}\) to get

\[\mathbb{E}[\boldsymbol{\varepsilon}^T M \boldsymbol{\varepsilon}] = \mathbb{E}\left[\sum_{i,j=1}^N M_{ij}\varepsilon_i\varepsilon_j\right].\]

We use the linearity of the expectation value and the key property that the expectation value of the product of two different gaussian random variables from the same distribution is zero but the variable squared is 1 to get

\[\mathbb{E}[\varepsilon_i \varepsilon_j]= \begin{cases}1 & \text{if } i=j,\\ 0 & \text{if } i\neq j.\end{cases}\]

which gives

\[\mathbb{E}\left[\sum_{i,j=1}^N M_{ij}\varepsilon_i\varepsilon_j\right]=\sum_{i=1}^NM_{ii}:= {\rm tr}(M),\]

i.e the trace of a matrix. Using the linearity of the trace and the cyclic property

\[{\rm tr}(A+ B (DB)^{-1} D)= {\rm tr}(A) + {\rm tr}( DB (DB)^{-1}),\]

we have

\[\mathbb{E}[\chi^2]=\mathbb{E}[\boldsymbol{\varepsilon}^T(I_N-A (A^T A)^{-1} A^T)\boldsymbol{\varepsilon}]= {\rm tr}(I_N)-{\rm tr}(A (A^T A)^{-1} A^T)=N-p\]

For non-linear models, we linearize it with the optimal solution \(\boldsymbol{\hat{\theta}}\) which should lay near the truth \(\boldsymbol{\theta}_0\) and use the arguments we presented above to approximate

\[\mathbb{E}[\chi^2]\approx N-p.\]

  1. at least I certainly haven’t until I wrote this 

  2. which I have seen in many places incorrectly stated or incorrectly attributed 

  3. The paper is a fantastic read and I would highly recommend checking it out when you can!