Why there is a Minus One in Standard Deviations

Introduction

Standard deviations are so often calculated when averaging data that functions for them have been standard features of scientific calculators for years but there are, confusingly, a choice of 2 to use. On the calculator buttons, these are typically labelled "σ_n" and "σ_n-1" [for those of you without Greek fonts, that's "sigma subscript n" & "sigma subscript n-1"]. Looking in the manual does not help much, probably just telling you that they are called "population standard deviation" & "sample standard deviation". Even if it gives you the formulae then that only tells you that the former is the root mean square difference of the individual data from the mean of the data whereas the latter is the same thing except that, instead of dividing by n to get the root mean square, n-1 is used. So which should one use & why?

Which to use

The first question is easier to answer. If you know the mean value of the data from somewhere else, use the n version but if you are calculating the mean value of the data from the data itself (by summing the data & dividing by n or using the button on the calculator) use the n-1 version. Normally with a calculator you will be finding averages and standard deviations from a set of measurements so the n-1 one is the one to use. Frankly, if you have lots of data (e.g. 10 more values) and are using standard deviations for their normal use of "standard error" reporting then use just either because the difference between them is negligible compared to other inaccuracies in calculating standard error values.

Okay, so that is which one to use but why?

Quick Explanation

When the mean is calculated from the n data, there are only n-1 'degrees of freedom' left to calculate the spread of the data. It would be cheating to use the other degree of freedom again.

Essentially, the mean used is not the real mean but a estimate of the mean based on samples of data found in the experiment. This estimate is, of course, biased towards fitting the found data because that is how you got it. Therefore estimating how widely spread around the real mean the data samples are by calculating how spread around the estimated mean they are is going to give a value biased slightly too low. Using n-1 instead of n compensates.

Longer Explanation

For this, one has to go back two levels: back beyond the use of standard deviation and the derivation of the formulae and think why it is used at all. As with all applied mathematics, there is nothing intrinsically magical about the definitions. They are simply chosen because they are useful. Neither standard deviations as error estimations nor means as averages are the only options available for their tasks. However their combination of ease of calculation, ease of algebraic manipulation & easily understandable connection to reality makes them so popular & ubiquitous that many users do not realise there are alternatives.

People want to predict things. That is what all science, engineering & indeed any human thought & intention is all for. People want. They want things to happen, want states of mind, want to know, even sometimes want not want. To get what they want they need to perform actions (even if those actions are purely mental as in some spiritual philosophies or the results are expected away from the physical universe as in some religious philosophies). But what action to perform? There are many to choose from. This choice needs prediction. If one can predict the probable outcome of actions then one has a basis for choosing between actions to get what one wants.

The normal way of prediction is looking for rules that have applied well in the past and assuming they will work into the future. It is true that the standard proof of such a method, that it has worked well in the past, is circular but at least it is not self-contradictory. It also has the advantage of simplicity (which, whilst maximal simplicity may be difficult to prove as fundamental requirement of theories, is obviously useful if one is to make use of the resulting theories). Unlike it's only common rival as a basis for choosing actions, "gut feeling", it has the ability to make unambiguous predictions and can be systematically refined as needed. At this point is tempting to go into a discussion of the overlap of the two whereby the past-rules method uses gut-feeling for generating a priori probabilities for hypotheses and the gut-feeling method is largely past-rule based anyway because brains are naturally brilliant for finding & acting on correlations in past data, but I am probably losing some engineer readers with this metaphysics so I will get back to the maths & lose the social scientists instead! Anyway; science is just the systematic refinement of the process of finding rules from past data to predict the future to decide on actions so that people get what they want. Finding those rules needs many tools and one of those tools is the maths of statistics, of which averaging is probably the most common action and estimating uncertainties not far behind.

There is a big intrinsic problem here - we are trying to find general rules but we only have specific examples, the data from our measurements - lets start with a simpler case by pretending we already have access to the rule.

In this case the rule is that there are a distribution of possible values that can come out of a measurement with some particular probability of getting each one. The full rule, a function relating each value to the probability of getting it, is too much hassle for us so we will simplify it to a rule with just two parameters: an indication of the most likely value (the average) & an indication of how far one can expect values typically spread from the most likely value (the deviation). Yes, this is losing a huge amount of detail but if one wants to quantify it with more than two parameters one can; I am sticking with two here because this essay is about standard deviation and two parameters is the simplest case to show it in.

There are many options for both the average & and the deviation. The commonest ones are the 'mean' (add up all the possible values weighted by probability) for the average & the 'standard deviation' (add up the squares of all the possible differences from the mean weighted by probability then square root) because they work well (i.e. usefully) in most situations and are mathematically easy. By mathematically easy, I do not necessarily mean that they are easy to calculate. Alternative averages called the 'mode' (the value with highest probability) & 'median' (the value in the middle of the probability distribution) are easier to calculate in some cases, as can be taking the average unsigned value of the difference instead of squaring & square rooting for the deviation. Instead I mean that the algebra for manipulating the formulae is easier. This enables there to be lots of extra tricks, uses, features etc. if one uses the mean & standard deviation. Typically for algebra, in contrast to computer computer calculations, things work easier if there are no sudden cut-offs or changes in the action of the formulae. For example the sharp change of negating all the negative differences but not the positive differences in the alternative deviation measure above actually makes algebra for manipulating the formula much more messy than the smooth squaring-square-rooting method of removing the signs in the standard deviation.

A example of the utility in the most commonly used probability distribution, the 'Gaussian' or 'Normal' distribution. The Gaussian distribution is so common because it is the only distribution shape that stays the same, just rescaling, when convoluted with itself. Hence if the resulting distribution from the combined effect of many independent distributions from different factors is generally going to tend to anything it will tend to Gaussian. A Gaussian distribution can be completely specified by just its mean & its standard distribution. A commonly used corollary of this is that, for a Gaussian distribution, there is a 2/3 approximate probability of getting a value within 1 standard deviation of the mean, 19/20 within 2 & 399/400 within 3. It is not magical that it works so nicely. It is the result of sticking to easy maths so that there are plenty of opportunities for routes to coincide so usefully. Indeed the link to Gaussian is so useful that people often forget that its results don't always apply and use them as unthinkingly as they do means & standard deviations. For example many biological scientists aim to get their experiments to give results with >95% confidence limits and psychological researchers need to confirm their findings on at least 20 people to satisfy journal referees for publication. Many do this without realising, or caring, that the standards they work almost religiously to were really just the 2 standard deviation positions on Gaussian distributions chosen for convenience in the days before electronic calculators and even arbitrarily apply them to distributions that are obviously not Gaussian.

So we now have the mean & standard deviation as simple useful measures of the distribution. Now to remove the assumption of a known distribution. From measurements we don't have the distribution, only n values sampled randomly from in the distribution. With these data we can certainly use the same formulae to calculate a mean & standard deviation for the data but what is usually really required is the mean & standard deviation for the distribution in the underlying rule. This subtly is normally ignored. Actually, one cannot literally get the required values but one can estimate them from the data. To generate formulae for these estimations needs some moderately involved algebra but it is simple to get an idea of what they should be. As more & more data is collected, it will become more & more unlikely that the distribution of data values will tend to anything other than a duplicate of the probability distribution of the underlying rule. Therefore, for sufficiently large data sets, one can simply use the mean & standard deviation calculated from the data as estimates of the mean & standard deviation of the underlying rule. That is the same result as if one had sloppily ignored the distinction between the measurements & the hidden rule in the first place.

What if one has less data and so needs the full versions of the formulae not the limiting cases? I am not going type a load of maths here in HTML (see a maths textbook like Matthews & Walker for the details) but here is an outline of the argument for the simple Gaussian case. Start by calculating the probability of getting a particular measurement value leaving everything as algebraic variables so one has an equation in terms of the value, the rule mean & rule standard deviation. For a Gaussian distribution, this will be of the form of a constant raised to the negative power of the square of the ratio of the difference of the value from the mean to the standard deviation for a Gaussian. Now work out the probability of getting a whole set of n data values by multiplying n such terms together. The most probable values of the mean & standard deviation can then be found by maximising the probability as a function of the n values. (Pedantic note: Philosophers may object to me casually switching the direction of the probability calculation here because I am implicitly assuming equal a priori probabilities for the possible mean & standard deviation values but some hypothesis is needed & this is the simplest.) The maximisation is eased by grouping all the exponents into one then taking logs so the only thing left to maximise is a polynomial of order 2 which is trivially handled by differential calculus. The result, once rearranged, are formulae for the most likely rule mean & standard deviation. The mean one is identical to that for the data & the standard deviation is almost the same as the one for the data except that the dominator in the averaging is not n but n-1.

The distribution in the hidden underlying rule is conventionally called the "population" & the collected data the "sample" so the two formulae for the standard distribution are called the "population standard distribution" & the "sample standard distribution" respectively.

Addendum: Divide by Square Root of n?

Whilst I am at it, there is another related thing that causes great confusion. Does one divide the standard deviation by the square root of n or not? Sometimes one sees this done and sometimes not. Which is correct depends on to what the the standard deviation applies. It is a point that is often ignored but "standard deviation" can be used for several purposes & there is a third common one in addition to the two above. The two above were both the standard deviations of values, whether recorded data or probabilities in the underlying rule, about the mean. The third one is the "standard deviation of the mean" and is a measure not of how the values are distributed about a mean but how one mean, the one from the data, is distributed about another, the hidden one in the rule. Of course, it is rather meaningless to ask for the distribution of a single item but if one did the who test multiple times then one would get slightly different values for the mean of data each time and each can be treated as an estimate for the mean in the underlying rule with a random inaccuracy in each. It is actually possible to estimate that standard deviation without having to collect multiple data sets by simply dividing the sample standard deviation by the square root of n.

This also resolves the problem with the intuitive feeling the inaccuracy represented by the standard deviation should decrease as one gets more data but that the standard distribution, being the width of the underlying distribution, should be a constant. Both are correct because those are standard distributions of two different things that just happen to be calculated from the same data and usually sloppily given the same name.

One can derive the formulae by churning through the same maths as before but doing it for probability of getting a mean of a sample of the distribution being a certain value this time (yes, it is messier maths) & out pops the answer that the probability of the mean also forms a Gaussian & its standard deviation is the same formula as before but with that extra division by that n factor.

Whereas the difference between the other two standard distributions is normally negligible, this one is normally much smaller than the other two so you do have to remember if you are using the standard deviation of the mean or not.

References

J.Mathews, R.L.Walker, 1970, Mathematical Methods of Physics; Second Edition, pub. Addison-Wesley, ISBN 0-8053-7002-1.