Tuesday, 15 March 2016

I didn't really do any of the improvements I discussed two years ago.

Basketball

Basically my solution this time was:

  1. Check with Julie that no team went undefeated this year.  That was my big problem last time.
  2. Run the 2016, 2015, and 2014 game solutions to determine the relative rankings for all of the teams in each of those years.
  3. Rank things based on the 2016 solution, letting 2015 solutions break ties.  Also use this information (and the 2014) for solutions to:
  4. Anytime a #12 sport-ranked team is ranked substantially above a 1-4 sport-ranked team, assume something is off with the model, because I have a lot of #12 ranked teams ranked really high for some reason.
So the result table is:


#2014.score     2015.score      2016.score    group   game  sport-rank  2016.score      Team.name
23.437500       28.125000       40.625000       1       1       1       40.625000       Kansas
-9.375000       -21.875000      1.562500        1       1       16      1.562500        Austin Peay
18.750000       -3.125000       17.187500       1       2       8       17.187500       Colorado
29.687500       7.812500        20.312500       1       2       9       20.312500       Connecticut
3.125000        32.812500       26.562500       1       3       5       26.562500       Maryland
9.375000        20.312500       29.687500       1       3       12      29.687500       S Dakota St
10.937500       4.687500        20.312500       1       4       4       20.312500       California
14.062500       14.062500       34.375000       1       4       13      34.375000       Hawaii
40.625000       43.750000       26.562500       1       5       6       26.562500       Arizona
NAN     NAN     NAN     1       5       11      NAN     VAN/WICH
1.562500        18.750000       28.125000       1       6       3       28.125000       Miami FL
14.062500       21.875000       9.375000        1       6       14      9.375000        Buffalo
12.500000       15.625000       17.187500       1       7       7       17.187500       Iowa
-20.312500      23.437500       15.625000       1       7       10      15.625000       Temple
37.500000       46.875000       37.500000       1       8       2       37.500000       Villanova
3.125000        -1.562500       17.187500       1       8       15      17.187500       UNC Asheville
21.875000       20.312500       34.375000       2       1       1       34.375000       North Carolina
NAN     NAN     NAN     2       1       16      NAN     FGCU/FDU
-15.625000      -12.500000      14.062500       2       2       8       14.062500       USC
18.750000       17.187500       20.312500       2       2       9       20.312500       Providence
3.125000        10.937500       28.125000       2       3       5       28.125000       Indiana
4.687500        18.750000       37.500000       2       3       12      37.500000       Chattanooga
20.312500       53.125000       26.562500       2       4       4       26.562500       Kentucky
18.750000       17.187500       31.250000       2       4       13      31.250000       Stony Brook
-3.125000       37.500000       15.625000       2       5       6       15.625000       Notre Dame
NAN     NAN     NAN     2       5       11      NAN     MICH/TULSA
1.562500        21.875000       28.125000       2       6       3       28.125000       West Virginia
45.312500       39.062500       34.375000       2       6       14      34.375000       SF Austin
29.687500       42.187500       12.500000       2       7       7       12.500000       Wisconsin
25.000000       6.250000        15.625000       2       7       10      15.625000       Pittsburgh
14.062500       12.500000       34.375000       2       8       2       34.375000       Xavier
12.500000       -6.250000       26.562500       2       8       15      26.562500       Weber St
21.875000       25.000000       34.375000       3       1       1       34.375000       Oregon
NAN     NAN     NAN     3       1       16      NAN     HC/SOUTH
23.437500       -7.812500       29.687500       3       2       8       29.687500       St Joseph's PA
32.812500       18.750000       18.750000       3       2       9       18.750000       Cincinnati
20.312500       23.437500       17.187500       3       3       5       17.187500       Baylor
7.812500        18.750000       25.000000       3       3       12      25.000000       Yale
28.125000       40.625000       20.312500       3       4       4       20.312500       Duke
-21.875000      6.250000        28.125000       3       4       13      28.125000       UNC Wilmington
20.312500       10.937500       12.500000       3       5       6       12.500000       Texas
1.562500        42.187500       15.625000       3       5       11      15.625000       Northern Iowa
3.125000        14.062500       29.687500       3       6       3       29.687500       Texas A&M
26.562500       23.437500       17.187500       3       6       14      17.187500       WI Green Bay
0.000000        4.687500        10.937500       3       7       7       10.937500       Oregon St
28.125000       26.562500       23.437500       3       7       10      23.437500       VA Commonwealth
21.875000       18.750000       28.125000       3       8       2       28.125000       Oklahoma
-9.375000       -7.812500       25.000000       3       8       15      25.000000       CS Bakersfield
34.375000       40.625000       29.687500       4       1       1       29.687500       Virginia
7.812500        -1.562500       17.187500       4       1       16      17.187500       Hampton
-6.250000       -9.375000       10.937500       4       2       8       10.937500       Texas Tech
-4.687500       18.750000       17.187500       4       2       9       17.187500       Butler
-3.125000       14.062500       29.687500       4       3       5       29.687500       Purdue
-3.125000       -7.812500       37.500000       4       3       12      37.500000       Ark Little Rock
29.687500       26.562500       15.625000       4       4       4       15.625000       Iowa St
17.187500       26.562500       18.750000       4       4       13      18.750000       Iona
0.000000        1.562500        26.562500       4       5       6       26.562500       Seton Hall
34.375000       46.875000       29.687500       4       5       11      29.687500       Gonzaga
14.062500       25.000000       28.125000       4       6       3       28.125000       Utah
4.687500        -3.125000       25.000000       4       6       14      25.000000       Fresno St
20.312500       26.562500       28.125000       4       7       7       28.125000       Dayton
34.375000       7.812500        9.375000        4       7       10      9.375000        Syracuse
28.125000       18.750000       35.937500       4       8       2       35.937500       Michigan St
23.437500       3.125000        23.437500       4       8       15      23.437500       MTSU
-1.562500       10.937500       9.375000        5       1       11      9.375000        Vanderbilt
53.125000       37.500000       25.000000       5       1       11      25.000000       Wichita St
14.062500       17.187500       10.937500       5       2       16      10.937500       FL Gulf Coast
-17.187500      -20.312500      6.250000        5       2       16      6.250000        F Dickinson
26.562500       0.000000        15.625000       5       3       11      15.625000       Michigan
14.062500       18.750000       14.062500       5       3       11      14.062500       Tulsa
9.375000        -3.125000       -7.812500       5       4       16      -7.812500       Holy Cross
9.375000        1.562500        15.625000       5       4       16      15.625000       Southern Univ

So, using this, I can answer the following questions I saw while doing the research of "figuring out what FLGU means".

  1. I saw a thing asking if Holy Cross was underrated.  My analysis says "no," and concludes with a "holy crap, no."
  2. Kansas is probably going to win it all.
  3. Julie was right, Michigan State should have been ranked higher than Virginia.
  4. I've already scored the two group 5 games that have played correctly.

Using the espn clicky thing to use these rules (I bent rule #4 to also apply to 13-ranked teams as well):

Group 1 and 2.

Group 3 and 4.

Final stuff.  I don't know how to call the score thing.  They're separated by ~4 points in the scores, or about 10%.  So maybe 10 points, since basketball is a "log10(score) ~ 2" kind of game?

For the remaining pre-game things, I have Michigan and Southern University winning those (in addition to the correctly called Wichita State and Florida Gulf Coast).

Monday, 1 February 2016

Normalization followup to last week

Remember last week when I got too lazy and bored to correctly normalize things in the newspaper column detection stuff?  This is probably the correct normalization.  v1 = col_mean / col_sigma; v2 = (v1 - median(v1))/madsig(v1).  This is essentially centering and whitening the tracer signal from last week (v1 here).

Wednesday, 27 January 2016

I just spent a large chunk of the evening fighting something, even after I'd come to an answer I was happy with.

The problem is fairly easy to state: given a printed page, can you reliably determine the column boundaries?  The answer has to be "easily, duh," right?  So, I found a bunch of newspaper images with different numbers of columns, and did some tests. 

First, I wrote a program that did regular and robust statistics on a column-by-column basis, and plotted these up.

The test image.
The normalizations are simply the mean of the given statistic.

So that kind of looks like garbage.  Without the normalization, it becomes obvious that at column gaps, the signal (the mean or median) goes up, as the intensity is brighter.  However, the variance (sigma or MAD-sigma) goes down, as the column becomes more uniform.  Therefore, if you divide the one by the other, these two features should reinforce, and give a much clearer feature:

Which it does.

So, applying it to the image corpus:
http://freepages.genealogy.rootsweb.ancestry.com/~mstone/27Feb1895.jpg

http://www.thevictorianmagazine.co.uk/1856%20nov%2022%20web%20size/1856%20nov%2022%20p517.jpg

http://www.angelfire.com/pa5/gettysburgpa/23rdpaimages/cecilcountydare2.jpg

http://malvernetheatre.org/wp-content/uploads/2012/07/Malverne-Community-Theatre-Newspaper-Reviews-14-page-0.jpg

https://upload.wikimedia.org/wikipedia/commons/b/b8/New_York_Times_Frontpage_1914-07-29.png

http://www.americanwx.com/bb/uploads/monthly_01_2011/post-382-0-70961700-1295039609.jpg

http://s.hswstatic.com/gif/blogs/vip-newspaper-scan-6.jpg

I probably should have kept the normalization factor, but that would require calculating them and sticking them in the appropriate places in the plot script, and I was just too lazy to do that.  Still, this looks like it's a fairly decent way to detect column gaps.  The robust statistics do a worse job, probably because they attempt to clean up the kind of deviations that are interesting here.  Compared to the baseline level, it looks like the gaps have a factor of about 10 difference.  It really only fails when the columns aren't really separated (or they might be, if I bothered normalizing the plots and clipping the range better).

Thursday, 10 December 2015

IRLS, but without having a line to fit.

My boss told me to think about applying the iteratively reweighted least squares to fitting a static data set.  So it's not really least squares so much as iteratively reweighted weighted means.  And in the single dimension test case I knocked together in about ten minutes, it converges in about five iterations to the correct solution for basically all contamination rates up to the point where the contaminated sample is 100% the size of the real sample.  This makes it switch over to fitting the contaminating sample.  Makes sense.

Tuesday, 30 June 2015

weighting functions

For IRLS fitting.
This clarifies why some of the weights were coming out the same, and why I was overfitting in a previous iteration.

It generally looks like all of these are largely equivalent.  Except, obviously, for the one that blows up at x=0, which was my typo weight.

Sunday, 22 March 2015

How many people are left?

I had time this evening to finally sit down and code up a simulator to answer the question I keep having when I watch this Last Man on Earth show (which I don't think is very good, since half the cast play horrible people, and because I can use the phrase "half the cast" on a show with that title).  Given that one person survived whatever killed everyone, how likely is it that there's another person alive as well.  This is basically a Bernoulli trial problem, where we don't know k or p.  I'm taking N = 300e6, which is close enough to the population of the US.  The results:


Normalized to the probability that everybody died, with the curves ranging for logarithmically spaced p values from 10**-8.4 to 10**-7.8.

If you think you're the only one alive, then you estimate that lp = -8.4.  However, once you find out that another person is alive, then lp >= -8.3 is the minimum (in this resolution), and those two cases have similar probabilities.  If you assume that k = 2 is the most likely probability, this moves lp to -8.1, and you have strong evidence that a third and fourth survivors are also likely.  Finding the third again bumps things up, and you can start expecting up to 8.  A fourth?  You've reached the other end of the simulation, and although the dynamic range of probability ratios increases, it's not excessive.

Basically, once the guy found Kristen Schaal, he's no longer the last man on Earth, and the likelihood of finding a lot more people jumps.  I will say that his strategy of travelling around writing the city he's living in seems like a good idea, assuming he wrote on a sign there that he's still driving around looking for people if they don't meet him.  That way, people will stay there and wait instead of assuming it's empty as well.  If he did meet a city with survivors, he could just change that sign to redirect to survivor-town.

Sunday, 15 March 2015

I've spent way too much time on this stupid problem: Endor vs Death Star.

Today's main point is a continuation of thoughts on Ewoks.  Earlier this week, I saw this image in my RSS stuff:


And searching online for it again led me to this, which is an overly long discussion of a bunch of nonsense about how Endor would have been incinerated by the debris from the Death Star.  After thinking about it for a few minutes, I came to the conclusion that that didn't make much sense, as pre-explosion, the DS had to have been moving sufficiently to not crash, so after the explosion, only a small fraction should hit the planet.  Right?

So let's throw physics at the problem, and see.  According to the Star Wars wiki page, Endor has a radius of 2450km radius.  If it has the same average density as the Earth, this gives it a mass of 3.39e23 kg.  We're trying to slam stuff into it, however, so let's go with an alternate calculation that the surface gravity is the same as on Earth (the Earth density Endor has about half-Earth gravity).  This gives about twice the mass at 8.82e23 kg.

Now we know what the target looks like, but what about the DS?  It apparently has a radius of 80km.  What does that mean in terms of mass?  No idea.  It's made of "quadanium steel," so we have to make something up.  Is it steel?  That's 7.75 g/cm^3.  Maybe quadanium makes it lighter, so maybe something like aluminum at 2.70 g/cm^3?  Plus, a lot of the DS is empty, because otherwise you couldn't do stuff inside it.  So there's a fill factor to deal with.  Let's say all the rooms look like this one:

This one.
That's an imperial shuttle, and those are 20m long, and roughly square-ish.  Stamp that footprint around, and I come up with something like 120m * 120m * 60m = 864000 m^3.  If the walls are 5m thick, the fill factor for the DS made up of these rooms is 33%.  In the first movie, the walls tend to look thin like a regular wall.  This yields a fill factor of about 1%.  This results in less mass for the DS, which turns out to be super important.  The high end mass is steel with fill factor=1.0, or 1.66e19 kg.  The low end is aluminum with a 1% fill factor, at 5.8832e16 kg.

Why?  The DS blows up.  It's a good "kablooey" kind of explosion.  Assuming this is going to completely disperse all the mass to infinity, this means the explosion is comparable to the gravitational binding energy, U = 3/5 * GM^2 / R.  For our 80km radius, and the high and low mass estimates give binding energies of 1.38e23 and 1.73e18 MJ.  This has to be converted to kinetic energy of all the debris particles, so assuming equal mass, each particle gets, U = 1/2 (M/Np) * v^2, so v = sqrt(Np) * [129 7.67] km/s.

What's left to know?  Where the DS is located in association to Endor.  I remembered this scene:

Zap!
But that turns out to not help.  I had planned to measure the length of the chord on the arc of Endor, and use that to work out how far behind the DS it is.  However, since this is the CGI reconstruction of two models on a stage, this doesn't work.  Using that chord to work out the projected scale of Endor, given the known "real" radii of the two objects means that Endor is in front of the DS, as the model Endor wasn't big enough to create the correct arc.

This isn't a big deal, as I remembered that the DS is orbiting Endor because there's a shield generator that's protecting it.  This means the DS has to be in a geostationary orbit.  That wiki page for Endor claims it has a period of 18 hours.  So the DS orbits at 18433km, and has an orbital velocity of 1.79km/s (which was yesterday's plot).

Put this all together, and run it through the N-body simulator to see what happens, and you get:

It freaking explodes.
How much does it explode?  It explodes all the way.  For the randomly placed particles (they all start on the surface of the sphere at 80km), one triggered the "collision" flag in the simulation.  This makes sense, as at the distance to Endor, it subtends 0.13radians, which translates to 0.435% of the full sky close to the 1/200 simulation particles hitting.  This is still 2.94e14kg being dumped on the planet.  That's about a third the size of the Chicxulub impact, although it'd be spread out somewhat instead of being a single impact.

The other fun thing I tried was reducing the explosion strength by slowing the particle velocity:

Red is 100 times smaller and blue is a 1000.
In these cases, the particles continue to largely follow the original DS orbit.  In the lowest energy case, Endor is spared, and the extra energy goes to kicking particles out.