Tuesday, 19 November 2013

Histogram binsize

Given a list of values you want to take a histogram of, what is the best binsize to use?  For data that's basically flat, you can just set the end points at the minimum and maximum values, and choose how many points you want per bin.  Assuming that the noise per bin is basically Poissonian, you can define the S/N you want for the bins, and then note that you then want on average k points per bin:
S/N = sqrt(k)

That's the easy case.  How do you do the same thing if the data is drawn from a Gaussian?  It's basically the same kind of an issue, except you need to choose where you want to optimize the S/N.  For the case from work, we largely ignore everything outside of 2\sigma, so I've used that as the critical point.  Now, for N total data points, how large of a box do you need centered on the critical point to achieve that S/N value?  Conveniently, this is just an exercise in error functions:
k/N = normcdf(critical + binsize/2.0) - normcdf(critical - binsize/2.0)
k/N = 0.5 * (erf(sqrt(2.0) * (critical + binsize/2.0)) - erf(sqrt(2.0) * (critical - binsize/2.0)))
So, you choose the S/N value for the critical point, determine the k/N value, and then find the binsize that achieves that (such as via the following figure).
This is for critical = 2\sigma.