Standard deviations are so often calculated when averaging data that
functions for them have been standard features of scientific calculators for
years but there are, confusingly, a choice of 2 to use. On the calculator
buttons, these are typically labelled "* _{n}*" and
"

The first question is easier to answer. If you know the mean value of the
data from somewhere else, use the *n* version but if you are calculating
the mean value of the data from the data itself (by summing the data &
dividing by *n* or using the button on the calculator) use the *n-1*
version. Normally with a calculator you will be finding averages and standard
deviations from a set of measurements so the *n-1* one is the one to use.
Frankly, if you have lots of data (e.g. 10 more values) and are using standard
deviations for their normal use of "standard error" reporting then
use just either because the difference between them is negligible compared to
other inaccuracies in calculating standard error values.

Okay, so that is which one to use but why?

When the mean is calculated from the *n* data, there are only
*n-1* degrees of freedom left to calculate the spread of the data. It
would be cheating to use the other degree of freedom again.

Essentially, the mean used is not the real mean but a estimate of the mean
based on samples of data found in the experiment. This estimate is, of course,
biased towards fitting the found data because that is how you got it. Therefore
estimating how widely spread around the real mean the data samples are by
calculating how spread around the estimated mean they are is going to give a
value biased slightly too low. Using *n-1* instead of *n*
compensates.

For this, one has to go back two levels; back beyond the use of standard deviation and the derivation of the formulae and think why it is used at all. As with all applied mathematics, there is nothing intrinsically magical about the definitions. They are simply chosen because they are useful. Neither standard deviations as error estimations nor means as averages are the only options available for their tasks. However their combination of ease of calculation, ease of algebraic manipulation & easily understandable connection to reality makes them so popular & ubiquitous that many users do not realise there are alternatives.

People want to predict things. That is what all science, engineering & indeed any human thought & intention is all for. People want. They want things to happen, want states of mind, want to know, even sometimes want not want. To get what they want they need to perform actions (even if those actions are purely mental as in some spiritual philosophies or the results are expected away from the physical universe as in some religious philosophies). But what action to perform? There are many to choose from. This choice needs prediction. If one can predict the probable outcome of actions then one has a basis for choosing between actions to get what one wants.

The normal way of prediction is looking for rules that have applied well in
the past and assuming they will work into the future. It is true that standard
proof of such a method, that it has worked well in the past, is circular but at
least it is not self-contradictory. It also has the advantage of simplicity
(which, whilst maximal simplicity may be difficult to prove as fundamental
requirement of theories, is obviously useful if one is to make use of the
resulting theories). Unlike its only common rival as a basis for choosing
actions, "gut feeling", it has the ability to make unambiguous
predictions and can be systematically refined as needed. At this point is
tempting to go into a discussion of the overlap of the two whereby the
past-rules method uses gut-feeling for generating *a priori* probabilities
for hypotheses and the gut-feeling method is largely past-rule based anyway
because brains are naturally brilliant for finding & acting on correlations
in past data but I am probably losing some engineer readers with this
metaphysics so I will get back to the maths & loose the social scientists!
Anyway; science is just the systematic refinement of the process of finding
rules from past data to predict the future to decide on actions so that people
get what they want. Finding those rules needs many tools and one of those tools
is the maths of statistics of which averaging is probably the most common
action and estimating uncertainties not far behind.

There is a big intrinsic problem here - we are trying to find general rules but we only have specific examples, the data from our measurements - lets start with a simpler case by pretending we already have access to the rule.

In this case the rule is that there are a distribution of possible values that can come out of a measurement with a probability of getting each one. The full rule, a function relating each value to the probability of getting it, is too much hassle for us so we will simplify it to a rule with just two parameters: an indication of the most likely value (the average) & an indication of how far one can expect values typically spread from the most likely value (the deviation). Yes, this is losing a huge amount of detail but if one wants to quantify it with more than two parameters one can; I am sticking with two here because this essay is about standard deviation and two parameters is the simplest case to show it in.

There are many options for both the average & and the deviation. The commonest ones are the 'mean' (add up all the possible values weighted by probability) for the average & the 'standard deviation' (add up the squares of all the possible differences from the mean weighted by probability then square root) because they work well (i.e. usefully) in most situations and are mathematically easy. By mathematically easy, I do not necessarily mean that they are easy to calculate. Alternative averages called the 'mode' (the value with highest probability) & 'median' (the value in the middle of the probability distribution) are easier to calculated in some cases as can be taking the average unsigned value of the difference instead of squaring & square rooting for the deviation. Instead I mean that algebra for manipulating the formulae is easier. This enables there to be lots of extra tricks, uses, features etc. if one uses the mean & standard deviation. Typically for algebra, in contrast to computer computer calculations, things work easier if there are no sudden cut-offs or changes in the action of the formulae. For example the sharp change of negating all the negative differences but not the positive differences in the alternative deviation measure above actually makes algebra for manipulating the formula much more messy than the smooth squaring-square-rooting method of removing the signs in the standard deviation.

A example of the utility in the most common probability distribution, the 'Gaussian' or 'Normal' distribution. The Gaussian distribution is so common because it is the only distribution shape that stays the same, just rescaling, when convoluted with itself and therefore if the resulting distribution from the combined effect of many independent distributions from different factors is going to tend to anything it will tend to Gaussian. A Gaussian distribution can be completely specified by just its mean & standard distribution. A commonly used corollary of this is that, for a Gaussian distribution, there is a 2/3 approximate probability of getting a value within 1 standard deviation of the mean, 19/20 within 2 & 399/400 within 3. It is not magical that it works so nicely. It is the result of sticking to easy maths so that there are plenty of opportunities for routes to coincide so usefully. Indeed the link to Gaussian is so useful that people often forget that its results don't always apply and use them as unthinkingly as they do means & standard deviations. For example many biological scientists aim to get their experiments to give results with >95% confidence limits and psychological researchers need to confirm their findings on at least 20 people to satisfy journal referees for publication without realising, or caring, that the standards they work almost religiously to were really just the 2 standard deviation positions on Gaussian distributions chosen for convenience in the days before electronic calculators and arbitrarily apply them to distributions that are obviously not Gaussian.

So we now have the mean & standard deviation as simple useful measures
of the distribution. Now to remove the assumption of a known distribution. From
measurements we don't have the distribution, only *n* values sampled
randomly from in the distribution. With these data we can certainly use the
same formulae to calculate a mean & standard deviation for the data but
what is usually really required is the mean & standard deviation for the
distribution in the underlying rule. This subtly is normally ignored. Actually,
one cannot literally *get* the required values but one can
*estimate* them from the data. To generate formulae for these
estimations needs some moderately involved algebra but it is simple to get an
idea of what they should be. As more & more data is collected, it will
become more & more unlikely that the distribution of data values will tend
to anything other than a duplicate of the probability distribution of the
underlying rule. Therefore, for sufficiently large data sets, one can simply
use the mean & standard deviation calculated from the data as estimates of
the mean & standard deviation of the underlying rule. That is the same
result as if one had sloppily ignored the distinction between the measurements
& the hidden rule in the first place.

What if one has less data and so needs the full versions of the formulae not
the limiting cases? I am not going type a load of maths here in HTML (see a
maths textbook like Matthews & Walker for the details) but here is an
outline of the argument for the simple Gaussian case. Start by calculating the
probability of getting a particular measurement value leaving everything as
algebraic variables so one has an equation in terms of the value, the rule mean
& rule standard deviation. This will be of the form of a constant raised to
the negative power of the square of the ratio of the difference of the value
from the mean to the standard deviation for a Gaussian. Now work out the
probability of getting a whole set of *n* data values by multiplying
*n* such terms together. The most probable values of the mean &
standard deviation can then be found by maximising the probability as a
function of the *n* values. (Pedantic note: Philosophers may object to me
casually switching the direction of the probability calculation here because I
am implicitly assuming equal *a priori* probabilities for the possible
mean & standard deviation values but some hypothesis is needed & this
is the simplest.) The maximisation is eased by grouping all the exponents into
one then taking logs so the only thing left to maximise is polynomial of order
2 which is trivially handled by differential calculus. The result, once
rearranged, are formulae for the most likely rule mean & standard
deviation. The mean one is identical to that for the data & the standard
deviation is almost the same as the one for the data except that the dominator
in the averaging is not *n* but *n-1*.

The distribution in the hidden underlying rule is conventionally called the "population" & the collected data the "sample" so the two formulae for the standard distribution are called the "population standard distribution" & the "sample standard distribution" respectively.

Whilst I am at it, there is another related thing that causes great
confusion. Does one divide the standard deviation by the square root of
*n* or not? Sometimes one sees this done and sometimes not. Which is
correct depends on to what the the standard deviation applies. It is a point
that is often ignored but standard deviations can be used for several purposes
& there is a third common one in addition to the two above. The two above
were both the standard deviations of values, whether recorded data or
probabilities in the underlying rule, about the mean. The third one is the
"standard deviation of the mean" and is a measure not of who the
values are distributed about a mean but how one mean, the one from the data, is
distributed about another, the hidden one in the rule. Of course, it is rather
meaningless to ask for the distribution of a single item but if one did the
test multiple times then one would get slightly different values for the mean
of data each time and each can be treated as an estimate for the mean in the
underlying rule with a random inaccuracy in each. It is actually possible to
estimate that standard deviation without having to collect multiple data sets
by simply dividing the sample standard deviation by the square root of *n*.

This also resolves the problem with the intuitive feeling the inaccuracy represented by the standard deviation should decrease as one gets more data but that the standard distribution, being the width of the underlying distribution, should be a constant. Both are correct because those are standard distributions of two different things that just happen to be calculated from the same data and usually sloppily given the same name.

One can derive the formulae by churning through the same maths as before but
doing it for probability of getting a mean of a sample of the distribution
being a certain value this time (yes, it is messier maths) & out pops the
answer that the probability of the mean also forms a Gaussian & its
standard deviation is the same formula as before but with that extra division
by *n* factor.

Whereas the difference between the other two standard distributions is normally negligible, this one normally much smaller than the other two so you do have to remember if you are using the standard deviation of the mean or not.

J.Mathews, R.L.Walker, 1970, Mathematical Methods of Physics; Second Edition, pub. Addison-Wesley, ISBN 0-8053-7002-1.