Thursday, 7 March 2013

Measures of variability

Actually: estimates of the dispersion of a distribution.  Everyone knows the standard deviation
sigma = sqrt(1/N * sum(x_i - mean))
That's pretty much the one everyone uses.  

Another strategy is to use the interquartile distance:
IQD = x_75% - x_25%
If you look at this for a Gaussian distribution (which is the assumption the standard deviation makes), you can use the cumulative distribution function for the normal distribution to see that you can get a "sigma" from this where
sigma_IQD ~= 0.7413 * IQD
This measure is somewhat more robust against outliers, as you can pile a bunch of points outside the inner half of the distribution, and they won't change the IQD much.  However, if you have a bias where all your outliers are on one side (say, all greater than the median of the distribution you care about), then you can only have 25% of all your points being outliers.

One way to fix this is to use the median absolute deviation:
MAD = median( abs(x_i - median(x_i)))
It's not too difficult to directly prove that for a symmetric distribution (Gaussian, or a Gaussian + unbiased outlier distribution)
MAD_symmetric = 0.5 * IQD_symmetric
However, if the outliers are biased, MAD is able to accept up to 50% ourliers before it really breaks down.  As above, you can create an effective sigma:
sigma_MAD ~= 1.4826 * MAD

Finally, you can histogram all the data, and do a Gaussian fit to that histogram to determine the width.  This has the problem that you need to have a decent number of samples before the fit will converge well.  It also can be more computationally expensive, since you have to fit a function.  Ok, now let's look at a bunch of plots.  For these, I created 10000 random samples, with some fraction "outliers."  The real distribution is a Gaussian of mean 0 and stddev 1.0.  The outliers are drawn from a flat distribution from -5 to 5 (for the unbiased case) or from 0 to 10 (for the biased case).
Biased.Unbiased.
From these, you can see that:

  1. For the biased outlier distribution, IQD fails around 25% contamination.  MAD is better until around 50%.  Histogram fits are generally better, but fail catastrophically around 60%.
  2. For the unbiased outliers, IQD and MAD are identical as suggested above.  Histograms are still better.
  3. Standard deviation is universally bad when any outliers are present.
Biased.Unbiased.
This is the same set of plots, now done for N = 100.  Histograms are now good only in the unbiased case with less than 50% contamination.  For the biased case, they're not better than the standard deviation above 20% contamination.

So that's largely why I like using a sigma based on the MAD.  It's pretty much universally better than the regular standard deviation when outliers are present.  Fitting Gaussians to histograms can do a better job, but that uses more complicated math, so it may not be useful if you don't have fitting code ready to go.