Tuesday, 19 November 2013

Histogram binsize

Given a list of values you want to take a histogram of, what is the best binsize to use?  For data that's basically flat, you can just set the end points at the minimum and maximum values, and choose how many points you want per bin.  Assuming that the noise per bin is basically Poissonian, you can define the S/N you want for the bins, and then note that you then want on average k points per bin:
S/N = sqrt(k)

That's the easy case.  How do you do the same thing if the data is drawn from a Gaussian?  It's basically the same kind of an issue, except you need to choose where you want to optimize the S/N.  For the case from work, we largely ignore everything outside of 2\sigma, so I've used that as the critical point.  Now, for N total data points, how large of a box do you need centered on the critical point to achieve that S/N value?  Conveniently, this is just an exercise in error functions:
k/N = normcdf(critical + binsize/2.0) - normcdf(critical - binsize/2.0)
k/N = 0.5 * (erf(sqrt(2.0) * (critical + binsize/2.0)) - erf(sqrt(2.0) * (critical - binsize/2.0)))
So, you choose the S/N value for the critical point, determine the k/N value, and then find the binsize that achieves that (such as via the following figure).
This is for critical = 2\sigma.

Friday, 11 October 2013

Linear algebra problem.

It's not immediately obvious, but any sort of solver, fitting, or minimization is all basically just linear algebra.  This week's work-related math problem was in the polynomial fitting code.  It only arises when the x-data is closely (like 1e-3) spaced.  How is this linear algebra?

Here's how you fit a polynomial.  First, define the Vandermonde matrix for a vector x as
VM[x] = [x.**n x.**(n-1) ... x.**2 x.**1 x.**0]
using matlab/octave style .-operators for element by element operations.  Basically, this has expanded the vector x into an array that has the various power terms in the columns.  You can then define a system of polynomials as VM[x] p, where the vector p contains the polynomial coefficients.  In the fitting form, this vector is what you're trying to find.  To do this, you do basically the same thing as regular fitting:
A p = b
(VM[x]^T VM[x]) p = (VM[x]^T y)
using some prior knowledge of how least squares fitting usually goes.  This is basically a matrix inversion (of A), but that's hard to do, so usually this kind of thing is solved with a LU-decomposition or the Gauss-Jordan method.  However, as mentioned above, if the x-data is closely spaced, this is failing, as our GJ implementation has some logic to prevent the construction row operations that have large scaling factors.

This leads to the idea of pre-scaling the data in x to prevent this from being a problem.  Define a new vector
x' = f x
where f is some scalar.  This means that
VM[x'] = VM[x] .* VM[f]
so
A' = (VM[x]^T VM[x]) .* (VM[f]^T VM[f])/length(x)
b' = (VM[x] .* VM[f])^T y
which has the solution
p = p' .* sqrt(diag(VM[f]^T VM[f])/length(x))

Great.  What's f?  This is where things start to be complicated.  Why does the GJ solver fail?  It's complicated, but basically boils down to the matrix A being ill-conditioned.  The degree of ill-conditioning for a matrix can be calculated as the condition number (with the most ill-conditioned matrices being non-invertable, where c_N = infinity).  The fastest way to do this is to calculate
c_N' = max(diag(A))/min(diag(A))
If this is large relative to some threshold (like 1e7), the matrix is ill-conditioned and should have some scaling.  Unfortunately, this is just a decent lower limit, so to do this correctly, you'd need to do something like
c_N = abs(max(eig(A))/min(eig(A))
or, even better:
[U S V] = svd(A)
c_N = max(diag(S))/min(diag(S))

In any case, I have the suspicion that the "optimum" f value is something like
f = sqrt(max( max(diag(S)), 1 / min(diag(S))
The idea being that you want to reduce the dynamic range of A in such a way that you no longer break the threshold.  One complication of this is that checking on the real matrix that's breaking things suggests that the "scaled" form that works is actually even more ill-conditioned than the unscaled, but since the test in the current code is uni-directional, this doesn't throw an error.

That was kind of hard to go through, so here's a Jigglypuff mocking Clefairies.

Monday, 23 September 2013

Possible solution to a major problem.

Allowing me to do this (left: original; right: modified reconstruction).
Given image i, such that i = a + n + c, where a is the astronomical signal, n is the random noise, and c is the detector specific corruption, define a new image i' = (a - a') + n + c, where we've removed an estimate of the astronomical signal a'.  This has a Fourier transform F(i') = (A - A') + N + C, using the fact that Fourier transforms are linear.  Assuming a' is a sufficiently suitable proxy for a, and that n is small (specifically that ||N|| << ||F(i')||), this leaves F(i') ~ C.  Constructing a Fourier mask that removes C results in t ~ (a - a') + n + c.  The difference of i' and t provides a clean estimate of c, which can then be removed from i: o = i - (i' - t).

I need to test that photometric properties are retained by this transformation, and I'd like to speed it up some and sort out some of the edge case issues (visible around the blank bands above), but I think this solves the bulk of the issues with this problem.

Wednesday, 26 June 2013

The Supreme Court

Specifically, killing them.

Hang on there, Secret Service.  We're doing it statistically, so it's just an math problem.  No one is actually getting hurt, and I mostly did it to see how soon it's likely that there will be a vacancy due to death.  Someone could resign at any time, and that would make a vacancy, but that's not something that can really be predicted.  Death is a bit tricky, too, but due to the fact that despite the best attempts of some people, we still have a government that collects piles of data, so we can use the cohort life expectancy data from the Social Security Administration.

From that data, we're going to use the "High-cost" columns, and use the "at age 65" data.  The logic behind this is based on two facts.  First, the majority of the difference between the "at birth" and "at age 65" expectancies is due to infant mortality and things like that.  If you survive to 65, you didn't die as a baby.  The supreme court is universally comprised of people who did not die as a baby.  The choice of the "High-cost" is that being a supreme court justice is a pretty good job, so they're likely to have good health care.  The final assumption I have to make is that people born before 1940 have life expectancies similar to those born in 1940.  The table only goes back so far, and I don't really want to extrapolate it.

The next assumption is that the probability of surviving a year is based on the logistic function, largely due to a lack of better options.  The form I chose was P(age) = 1 / (1 + exp(-(expectancy - age))).  This sets the 50% probability at the expectancy age.  I could scale it, but I don't have any good idea of how to do that, so I'm just letting that be 1.0.

Using this table of birth dates and ages, we can make the following plot:
This shows the probability that that justice is alive.  I'm sorry that this claims Ruth Bader Ginsburg is 52% dead.
Ok, using these probabilities, we can plot up the number of justices statistically alive by simply summing these curves:

This suggests that (barring the Senate being jerks), Obama is likely to appoint two justices.  His successor falls into a generational gap between the Breyer and Thomas, and isn't likely to appoint anyone, even in an eight year term.  It's fairly linear over the next five terms, with about one each term.  This assumes that all newly appointed justices are younger than the current youngest, and that we're only concerned with the current set.

So again, barring the Senate terrible, we should have a significantly more liberal court within the next four years.

Batman and Robin are happy about that fact, too.

Sunday, 16 June 2013

Because fundamentally misunderstanding physics annoys me

I just read an article where some moron is writing that people should build their own anti-satellite system to blind spy satellites using laser pointers.


Thinkgeek sells this laser pointer with a beam divergence of 1.2 milli-radians.  


Ok, fine, this page lists a few others, and it looks like the best they can come up with is a 0.5 mrad.  Geosynchronous orbit is 35786 kilometers up.  Using the power of tan(x) ~ x for small x (and noting that 5e-4 radians is really small), this gives a beam diameter (assuming a point source at the laser) of 17.893 kilometers.  This seems to work out to the irradiance at that satellite is something like a factor of 3.1234e-15 smaller than it is at the source.   In conclusion,

Thank you, Lois.


Sunday, 26 May 2013

More Fourier transform stuff.

Consider this image:
This is just a sine wave tipped at 45 degrees.  Now, look at the power spectrum:
Ok, the two spots make sense, as those correspond to the point in the (U,V) plane that has that frequency, along with it's reflection (the negative frequency).  But what's up with all that other crap?

FFTs make the assumption that the signal is periodic, and since it only has the image to work with, the default thing to do is to calculate a circular FFT, where the left and right edges wrap around, and the top and bottom do as well (making a video game torus).  However, since the sine signal doesn't nicely match up at the edges (most obvious at the top and bottom), this introduces a lot of extra Fourier components that correspond to those + shapes.

The solution to this problem is to taper the image down so that it is effectively periodic.  The easiest way to do this is to just multiply by a Gaussian so that the edges drop down to zero at the edges, like this:
Ok, what's this look like in (U,V)-space?
The two dots we expect.  There's some lower level noise, but this has the same colormap as the previous spectrum.  The dots aren't perfect points, which I think tells us something about the size of the Gaussian we used.

This is important, because the science stuff I've been playing with has odd edge effects that impose a strong sinc^2 signal onto the power spectrum.  This makes it difficult to isolate bad frequencies.  Remembering this trick solves that problem, and should make cleaning those images much easier.

Tuesday, 14 May 2013

Another reason why Fourier transforms are magical.

Let's pretend we have an image, and that image is just a bunch of dots:
Here are three, with different numbers and positions for the dots (randomly distributed), except for the centered single dot case.
I've chosen Gaussians for the shape of the dots, because those are simple to work with.  Now, take the 2D Fourier transform of these three images.  What does the power spectrum look like?
Those curly-cue shapes are ~20 orders of magnitude smaller than the peak.  That's what I get for plotting logarithms.
To within the floating point accuracy of my computer, these power spectra are all identical.  No matter where the dots are on the image, the power spectrum is identical.  How does that work?  Basically, the power spectrum just says "the image is this big, and the things that show up on it look like this."  All of the position information is stored in the phase information between the real and imaginary components:

The single dot has a constant phase image.  The other two have phase images that look like random noise, but are actually encoding the position information for each dot.  Next interesting thing:  what happens when we increase the size of the dots:

Bigger dots have a smaller sized dot in the power spectrum, as wider things are dominated by smaller frequency terms (also the fact you can directly calculate that F(Gaussian(sigma)) = Gaussian(1/sigma)).

Thursday, 7 March 2013

Measures of variability

Actually: estimates of the dispersion of a distribution.  Everyone knows the standard deviation
sigma = sqrt(1/N * sum(x_i - mean))
That's pretty much the one everyone uses.  

Another strategy is to use the interquartile distance:
IQD = x_75% - x_25%
If you look at this for a Gaussian distribution (which is the assumption the standard deviation makes), you can use the cumulative distribution function for the normal distribution to see that you can get a "sigma" from this where
sigma_IQD ~= 0.7413 * IQD
This measure is somewhat more robust against outliers, as you can pile a bunch of points outside the inner half of the distribution, and they won't change the IQD much.  However, if you have a bias where all your outliers are on one side (say, all greater than the median of the distribution you care about), then you can only have 25% of all your points being outliers.

One way to fix this is to use the median absolute deviation:
MAD = median( abs(x_i - median(x_i)))
It's not too difficult to directly prove that for a symmetric distribution (Gaussian, or a Gaussian + unbiased outlier distribution)
MAD_symmetric = 0.5 * IQD_symmetric
However, if the outliers are biased, MAD is able to accept up to 50% ourliers before it really breaks down.  As above, you can create an effective sigma:
sigma_MAD ~= 1.4826 * MAD

Finally, you can histogram all the data, and do a Gaussian fit to that histogram to determine the width.  This has the problem that you need to have a decent number of samples before the fit will converge well.  It also can be more computationally expensive, since you have to fit a function.  Ok, now let's look at a bunch of plots.  For these, I created 10000 random samples, with some fraction "outliers."  The real distribution is a Gaussian of mean 0 and stddev 1.0.  The outliers are drawn from a flat distribution from -5 to 5 (for the unbiased case) or from 0 to 10 (for the biased case).
Biased.Unbiased.
From these, you can see that:

  1. For the biased outlier distribution, IQD fails around 25% contamination.  MAD is better until around 50%.  Histogram fits are generally better, but fail catastrophically around 60%.
  2. For the unbiased outliers, IQD and MAD are identical as suggested above.  Histograms are still better.
  3. Standard deviation is universally bad when any outliers are present.
Biased.Unbiased.
This is the same set of plots, now done for N = 100.  Histograms are now good only in the unbiased case with less than 50% contamination.  For the biased case, they're not better than the standard deviation above 20% contamination.

So that's largely why I like using a sigma based on the MAD.  It's pretty much universally better than the regular standard deviation when outliers are present.  Fitting Gaussians to histograms can do a better job, but that uses more complicated math, so it may not be useful if you don't have fitting code ready to go.