Wednesday, 15 January 2014

I am irrationally concerned with good statistics

k, statistics again.  The problem is that I saw this article today, which basically complains that "no one really means to use standard deviation, as people intrinsically want to use the mean absolute deviation" which is, of course, completely dumb.

First, no one would ever do mean absolute deviation in their head.  Here are some numbers: {-1 2 3 -5 1 400}.  If you had to guess another number that would belong to this set, you're going to guess like "dunno, zero maybe?"  You know that 400 is probably wrong, so you cut it out.  People don't do real means when they filter data.  It's some combination of a mode and median.  Choose a number that doesn't seem crazy.

Second, this mean absolute deviation tells you about where the 50% point falls.  Why that point?  The standard deviation is more inclusive, as it tells you that most (Q(1) = 68.change%) samples are closer to the central value.

Third, all that obvious stuff about moments analysis.

Anyway, time for plots.  These are the same idea as the ones from the previous post, just remade with more samples and different stats.  The horizontal lines are the true uncontaminated distribution sigma and the true fully contaminated sigma (sigma_uniform = sqrt((b - a)^2 / 12), because math).  First thing to note:  Actual sigma cleanly switches from the two extremes, as it really should.  Gaussian fits are best, but IQD and MAD are comparable up to the 50% contamination point.  MeanAD doesn't seem particularly good.  The full contamination end is biased, as I'm using a parametric model (that it's a Gaussian distribution).
Biased samples.  This nicely shows that IQD fails before MAD, and that Gaussian fits are reasonable up to 60% contamination.  MeanAD is again off kind of doing its own thing.  Median >>> mean for outlier rejection.

No comments:

Post a Comment