astrochris science: 2014

Saturday 25 October 2014

Orbital mechanics

Julie mentioned a hack-mo-tron thing about asteroids last week. I was reasonably sure that the math wasn't super difficult, and so after waking up today, I decided to sit down and see if that was true or not. First, you need to see the diagram of things from wikipedia:

Then you need to note that the orbit is elliptical, and finally remember Kepler's second law.

And then you do math at it.

It's harder than I thought, mostly because step two doesn't have a analytic solution. The methodology here is as follows.

First, work in a coordinate system in which the origin is at one focus of the elliptical orbit, and write down the equation for the distance to the orbit from that origin as a function of the true anomaly, listed here as theta. Precalculate the integral of that because you need it later.
Next, we want to work in fixed (or at least known) time steps. Therefore we need to know theta as a function of t. The t here is actually (t_abs - t_perihelion) / period, in order to normalize it sanely. Next, you do a bunch of calculus, and use the fact that the full area is swept out over one period to calibrate things. This gives a mess that's effectively a constant for a given t, called kappa here. This has to be solved analytically, but over the range [-pi:pi], there's only one non-trivial solution, so you can do a quick Newton method solution to find it.
Use the anomaly and orbital radius to write the vector U, which contains the (u,v,w) position in orbit coordinates.
Use R_omega to orient the orbit in real space using omega, the argument of perihelion. This is just a simple spin. Next, use R_Omega to handle the inclination, with Omega the ascending node and i the inclination angle. This yields X, which is the (x,y,z) in a universal coordinate system.
Deproject (x,y,z) to (x',y'), because there is no decent 3d visualization code.

I don't have plots because I couldn't find a good arbitrary plotting library that was easy enough to use to do so.

Sunday 28 September 2014

The Simpsons cast vs aging.

I came home, watched a Simpson's episode that was sufficiently bland that I sat down and reimplemented the thing I did for the Supreme Court for a generic data file. Feeding the main voice actor cast for the Simpson's into it (and quantizing on years because working out partial years would have taken more time than I wanted to spend.

Individually.

Summed.

Based on this, The Simpson's is unlikely to lose any of the main cast until 2020 at the earliest. Assuming they keep going after that, they can continue for about another twenty years before they lose half of the cast. However, Harry Shearer is that first line, and he voices about a third of all characters on the show. That would probably be hard to recover from.

Tuesday 29 July 2014

"Hey, I could probably automate that in perl, and make it a game."

cat kitten sword robot space pink white bumblebee bee helmet

Code under the cut. Rules for the game: player one chooses an image, and the other players need to come up with descriptive words for what's on it. Then, you run the command on the best set of words (maybe you don't want to use all of them because you end up with legos. Maybe you get points if you only use like 1 word, but if you use 9 or more, but get legos, you lose points). If the image itself, or the character, or an obviously "these images are related somehow" match happens, then the seekers get the point. If not, then the player who chose the original image gets a point. So under these rules, the Bee and Puppycat image gets a point, and I lose two points because I didn't find B&PC, and I ended up with legos when I used 10 words.

Because some questions deserve answers.

I saw this episode of Bob's Burgers last night. Basically, Bob is nearing his 100000 burger, and some jerk shows up with a cow to protest animal cruelty and asks "how many cows had to die for your burgers?"

First, let's take the National Cattlemen's Beef Association beef research guide, and accept their calculations for a standard 1300-pound steer. Summing their numbers and then noting after-the-fact that they do it themselves on page 7, this results in a cow containing 638.2 pounds of meat, of which 265.1 pounds are ground beef. This is where you have to decide if Bob can offset future ground beef with the other cuts contained in the cow or not (is he charged .41 cows or 1 cow per cow). Assuming he's making standard quarter pound burgers, this results in either 39.173 cows (assuming he's allowed to offset) or 94.304 (if all cows that contribute count).

That's way lower than I would have guessed beforehand.

Saturday 12 April 2014

I was going to talk about statistics, but instead I'm going to talk about sports.

No, not really. Mostly it's statistics, just stapled onto a sports frame.

Firstly,

Sports.

I don't generally care about sports.

However, there was that Warren Buffett billion dollar thing, so it became slightly less annoying than it usually would be. The important thing to remember was to not actually watch any of the games, because they really don't matter.

"Huh?"

Yeah. Here's the secret: sports teams are just random number generators, and if they come up with a bigger number than their opponent, they win.

"But teams are made up of people..."

More RNGs. (Theorem left unproved due to obviousness: The output of the sum of multiple RNGs is itself a RNG.)

"that have different capabilities that influence how the game ends."

So you're telling me they're biased RNGs.

"And you have to incorporate details about how they're doing, and think about how pumped up they are for a given game, and all sorts of other things that influence the outcome!"

Got it. Teams are biased RNGs, and if they come up with a bigger number, then they win. Some RNGs have a built in bias, signifying that they're "a better team."

A Model.

The nice thing is that this is kind of already a solved problem. Latent ability models. Given a dataset representing the outcomes of many games, you can assign a score that represents how good a team is. This "ability" is "latent" because you don't know what value it has, or even what determines the value in terms of real-world qualities. But you can look at the data you have, and use it to decide if in a future match up, team A or team B would win. Here's a plot:

Once you've assigned abilities, then you expect the probability of a victory to be related to the difference of the abilities for the two teams. You use this logistic function to model that probability to ensure that you don't get absolute probabilities for most situations.

"But what about teams that don't play each other?"

Since the ability of team A is based on the differences in ability with all of the teams they did play. This means you don't need to know the full matrix of team A vs everyone else, you can use the teams A did play, and compare the scores. In other words, if A is better than B, and B is better than C, odds are good that A is better than C, even if you've never seen A and C compete.

There is some concern about the connectivity of the dataset that can cause problems. If the dataset is poorly connected, you can get scores where one subset is poorly scaled to the other subsets, as you don't have enough overlap to determine a good solution. So, what's the connectivity of my dataset look like?

The logistic shape is largely coincidental. Mathematically motivated, but not important now. This is also a realization for a single team (team 790), because I didn't care to trace out the full connectivity statistics.

The teams team A plays are a tiny fraction (about 1-2% of the full set of teams). However, the teams those teams play connects about 25% of the set. Three steps yield 95%. Four doesn't quite get to 100%, but I stopped caring at that point.

So since the data is reasonably well connected, you'll probably get a decent solution to the latent ability scores. The final check (that I totally did when I was developing things, thus proving validity, and not at all just ten minutes ago when I realized I should include this) was to compare the ability scores that I came up with the official "seeds," which is a dumb name for "ranking priors."

Noisy but fairly linear.

My Results.

That's all nice and all, but how does it work? First, the caveats: I wrote the LAM code and pushed some tests through in January, and then completely forgot about this whole thing until the day before the tournament started. This means I basically had to accept the simple LAM solutions, and go with it. I did a full year solution, and also broke the games down by month, using those solutions to look for trends (if the abilities were improving or decreasing). Finally, I used the scientific method of "which of these are bigger?" to choose who would win.

A side note unrelated to the statistics: dancing around team name strings was by far the most annoying part of this project. Everyone should use clear and concise integer team ids, like I did. It's the superior solution than inconsistent abbreviations.

Here's who I chose.

Not bad, as I even picked up a few of the upsets. Things fell apart a bit with the later games, but 38/63 isn't bad, right?

Donking Sports.

At least I beat Julie, which was really what my goal was, if I couldn't win a billion dollars.

One of the things that I probably should have researched is the fact that people who do sports all the time are dumb. Therefore, later games magically get more points than early games, so you aren't just trying to have more correct picks than the other people. I suspect this is due to someone getting pissed off that they chose the winner correct, but donked up everything else, and decided that they should win instead of someone who got most of the first round correct. I can't come up with any other logic for it (if you get round 1 picks wrong, you automatically lose out on any subsequent round outcomes that are based on that incorrect pick. Why add extra penalties?).

In any case, you can see that I really wasn't too far back in terms of number correct. You'll also note that I am actually #12, not #13, unless there's some other dumb tie-breaker rule that says that even though 12 and I have the same number of points, and I have more correct picks, I don't win because sports.

I also checked earlier in the week, and I didn't beat Barack Obama, regardless of whether you do number correct or this stupid point system. I don't remember the values I got. Like 40 correct and 68 points, maybe? Whatever, I'm losing interest in this post anyway.

How Can This Be Improved?

As I mentioned above, I kind of rushed at the last minute, and spent more time dancing strings around than I wanted. This means the theory time and development was more rushed than I would have wanted. In any case, here are the obvious ways to improve things.

Wichita State. After determining everything, I later heard on NPR that WS was undefeated in the regular season. "Great?" Nope. Because they never lost, the LAM had no reason not to rank them higher than everyone else. Basically, no matter how high a score some other team got, WS would always rank higher.
Time evolution. My month-by-month breakout is way too chunky, and I'm pretty sure it didn't equally distribute the data. Equal game count bins would be better, and that can be tuned to yield the minimum size that still provides a connected data set.
Game quality. The dataset I have contains scores for each game. It's easy to imagine a LAM extension in which the distribution of score differences by ability differences tells you something about the distribution of that teams ability. The other option would be to use that score difference to amplify a given win. If the victory was by 1000 points, you can probably assume that the winner is "more better" than the loser than if they'd only won by 1 point.
Field advantage. It also contains home/away status. This might have a similar effect, where you don't weight home victories as much as away or something.
Inter-year convolution. I have lots of years. Teams are comprised of players, and players are not new each year. Therefore, you could convolve previous and subsequent year abilities with this year's, and see what that gives you.
Data. I have lots of years. I should really have been applying the algorithm to all the years, and checking the results against what really happened. Beyond just this one tournament, women's sports are identical, so that's another set of years with data and results that can be used to train the algorithm.

Wednesday 26 March 2014

I accuse Mr. Green of committing the crime in the Ballroom with the Lead Pipe!

Or, more to the actual topic of this post, I accuse old people of opposing progress with their not dying.

Confused yet?

So I came across something today that pointed me to this survey of gay marriage polling tracks over time. I then noticed that this page also includes historical trends for interracial marriage as well.

My obvious question to myself was, "how much of that trend can you explain simply by assuming things are improving because old people die?"

That led to a search for detailed cohort data, because I need to know when people are dying. Google google google, and I find this page, which claims the image is from the Census people, although they've since rearranged their webpage so the link doesn't work anymore. I'm accepting that, because who lies about cohort data? Crazy statisticians, I guess.

Here's the cohort image, copied in case that link disappears too.

I digitized that image, and then write a perl script to calculate the fraction of the population on a given date that were born before a certain year. This number subtracted from 100 gives the fraction born after that date. What does that look like?

Due to the five-year blocking of the cohorts, there's a stair-step effect. Interpolate a best fit line through that data, basically through the midpoints of each step.

Huh. That worked far better than I expected. Basically, ignoring all other factors, you can pretty much explain the entire trend if you just assume that everyone born after 1960 (who would have been kids during the Civil Rights Movement) is cool with interracial marriage, and that everyone born after 1980 (I'm going to use the Simpsons as my cultural shift here, because I don't want to go with "people my age") is cool with gay marriage.

Wednesday 12 March 2014

PCA, qualitatively.

Basically, it's kind of like this.

Wednesday 5 February 2014

Topology?

Topology is not my best math. Sure, I solved the bridges of Koningsberg problem in high school as you do, and I can feign a laugh at the doughnut/coffee cup trick ~~that isn't decades old and seriously come up with a new joke already~~, and I realized what the shape of the Space War universe was while playing the game on an Atari 2600 with no sun and no bouncing (also: old video games always had kick ass covers).

Why are we talking about topology? In E17, I like to have a number of virtual desktops, and I like them to be all connected to each other so I can flip between them by pushing on the edge of the screen. FVWM did this years ago, and I always liked the idea. E17 also supports wrapping of virtual desktops, so you end up with something like this:

My laptop uses a three by three grid, but it's the same thing.

Where there is a two-by-two grid of desktops, each connected to the others along wrapped edges. This is obviously identical to the SpaceWar universe, as you can imagine this as just a really crappy tv with only four pixels. Therefore, my standard desktop arrangement is homeomorphic with a torus. Easy peasy.

Now, at work, I have a dual monitor setup, and I assumed this would just work the same, but with fat desktops. However, I discovered upon getting everything configured that it's actually this far more complicated thing:

Green arrows are bidirectional, as they were above. Orange are unidirectional.

The two monitors have their own set of virtual desktops, but, since going across the monitor boundary must put you on the other monitor, they're not connected together. If you're on the left monitor, there is no right edge you can cross to get to left side of the adjacent desktop (imagine the center red screen to pink). You can get there by going to the unopposed left edge of this desktop to arrive on the right edge of that desktop, but you then can't go back, as you jump over to the other monitor.

I don't have proof (again, see "worst math"), but I believe this defines a very oddly connected hollow torus. I don't know how you construct a unidirectional connection in topology, but assuming that's valid, you basically have some sorted of ratcheted torus for each monitor: you can travel poloidally unobstructed (corresponding to vertical shifts), but can only travel in one direction toroidally. Switching between monitors is then functionally equivalent to passing through the ratchet onto the other side of the torus. There's some ambiguity, since you can then shift each monitor separately, so where you come out on the other side of the torus isn't a priori obvious, but I think the bulk of the thing is sorted.

Anyway, I thought it was cool, and now my already-confusing-to-other-people desktop is going to get more confusing.

Wednesday 15 January 2014

I am irrationally concerned with good statistics

k, statistics again. The problem is that I saw this article today, which basically complains that "no one really means to use standard deviation, as people intrinsically want to use the mean absolute deviation" which is, of course, completely dumb.

First, no one would ever do mean absolute deviation in their head. Here are some numbers: {-1 2 3 -5 1 400}. If you had to guess another number that would belong to this set, you're going to guess like "dunno, zero maybe?" You know that 400 is probably wrong, so you cut it out. People don't do real means when they filter data. It's some combination of a mode and median. Choose a number that doesn't seem crazy.

Second, this mean absolute deviation tells you about where the 50% point falls. Why that point? The standard deviation is more inclusive, as it tells you that most (Q(1) = 68.change%) samples are closer to the central value.

Third, all that obvious stuff about moments analysis.

Anyway, time for plots. These are the same idea as the ones from the previous post, just remade with more samples and different stats. The horizontal lines are the true uncontaminated distribution sigma and the true fully contaminated sigma (sigma_uniform = sqrt((b - a)^2 / 12), because math). First thing to note: Actual sigma cleanly switches from the two extremes, as it really should. Gaussian fits are best, but IQD and MAD are comparable up to the 50% contamination point. MeanAD doesn't seem particularly good. The full contamination end is biased, as I'm using a parametric model (that it's a Gaussian distribution).

Biased samples. This nicely shows that IQD fails before MAD, and that Gaussian fits are reasonable up to 60% contamination. MeanAD is again off kind of doing its own thing. Median >>> mean for outlier rejection.

Friday 10 January 2014

Fish sandwich is about this early, and twitter works right on tv because I see how the lazy weekend is bizarre conniving prostitutes prostitutes

-Or-

So, let's talk about Markov Chains, I guess.

This news story popped up in the RSS today, which reminded me of the long ago time in which Forum 2000 was a thing (note: Forum 2000 never worked like that. It was all basically a mechanical turk).

I then remembered that you can download all your tweets from twitter, so blammo, I have the text corpus necessary to hack up something to do this.

So, those Markov Chains. Here's the simple way to think of it: starting from initial point A, there's a probability p_B that you'll move to point B, p_C that you'll move to C, etc. However, there are also probabilities for all those points too, so you can chain things together. Maybe go look at the wikipedia figure. This isn't turning out to be that simple of an explanation.

In any case, if you let each "point" be a word in a tweet, then if you have a large sample of tweets from someone, you could imagine that you could construct fake tweets if you knew the probability that a given word follows another. That's what I did in perl, and that's where the title came from. I did a simple chain, where w_{i+1} is drawn directly from the probability distribution of words that follow word w_i. There's a reset condition where if the set of words that follow w_i is empty, I restart the chain from the set of "words that start a tweet". I've also forced the length of the generated string to be 140 characters, the standard tweet length (and forced everything lowercase, and removed all punctuation, etc, etc, etc).

Here are some more examples:

i was far the whole dinner but that @wholefoods sushi was aborted because its weird being even fucking clue what hes a cheater :-p theres no
hey verizon network that too hungry or more not sure that flasks were afraid the sun orbit simulator its now its someplace else too soggy as
happy birthday @jkru isnt helping anybody know and hes never a 12 16 garbage bags tied directly to work - british accent for being wasted its
rt @vgc_scott man that quote and cranky tonights dinner oh man its still preparing the oil is actively looking up a workaround to be going to
first time you dont want a walk an impenetrable island which is like it looks close eyes will always just woke up dry before i not having intermotrons
@jkru its all the music #fuckedupdream peter potomus cartoons are you your app its official word for liberty tsa approved for my implementation
dear this is what is terrible monsters outside who like fog and colby isnt even left there to the 100th day yay deposed monarchy ok so hopefully

Those are just nearly not gibberish enough for you to think that they're not computer generated. The crazy thing about MCs is that they do a wonderful job of constructing sentences that are nearly grammatically correct. This could probably be improved a lot, such as trying to pick words that have some influence from the second previous word or having it clearly mark where it's restarted the chain with a period or something. Still, for like ten minutes of work, this isn't too bad.

Finally, an interesting sampling from printing out the chain transition probabilities:
fuck you 0.294392523364486 63 214
fuck it 0.0794392523364486 17 214
fuck fuck 0.0700934579439252 15 214
fuck yeah 0.0700934579439252 15 214
fuck up 0.0327102803738318 7 214

[...]

fuck #sandwichtweets 0.00467289719626168 1 214

fuck shitshitshitshitshitshitshitshitshitshitshitfuckshitshitfuck 0.00467289719626168 1 214

[...]

fucking ghost 0.00454545454545455 1 220

[...]

fuckity fuck 1 3 3

Saturday 25 October 2014

Sunday 28 September 2014

Tuesday 29 July 2014

Monday 16 June 2014

Saturday 12 April 2014

Wednesday 26 March 2014

Wednesday 12 March 2014

Wednesday 5 February 2014

Wednesday 15 January 2014

Friday 10 January 2014