astrochris science

Sunday, 28 September 2014

The Simpsons cast vs aging.

I came home, watched a Simpson's episode that was sufficiently bland that I sat down and reimplemented the thing I did for the Supreme Court for a generic data file. Feeding the main voice actor cast for the Simpson's into it (and quantizing on years because working out partial years would have taken more time than I wanted to spend.

Individually.

Summed.

Based on this, The Simpson's is unlikely to lose any of the main cast until 2020 at the earliest. Assuming they keep going after that, they can continue for about another twenty years before they lose half of the cast. However, Harry Shearer is that first line, and he voices about a third of all characters on the show. That would probably be hard to recover from.

Tuesday, 29 July 2014

"Hey, I could probably automate that in perl, and make it a game."

cat kitten sword robot space pink white bumblebee bee helmet

Code under the cut. Rules for the game: player one chooses an image, and the other players need to come up with descriptive words for what's on it. Then, you run the command on the best set of words (maybe you don't want to use all of them because you end up with legos. Maybe you get points if you only use like 1 word, but if you use 9 or more, but get legos, you lose points). If the image itself, or the character, or an obviously "these images are related somehow" match happens, then the seekers get the point. If not, then the player who chose the original image gets a point. So under these rules, the Bee and Puppycat image gets a point, and I lose two points because I didn't find B&PC, and I ended up with legos when I used 10 words.

Because some questions deserve answers.

I saw this episode of Bob's Burgers last night. Basically, Bob is nearing his 100000 burger, and some jerk shows up with a cow to protest animal cruelty and asks "how many cows had to die for your burgers?"

First, let's take the National Cattlemen's Beef Association beef research guide, and accept their calculations for a standard 1300-pound steer. Summing their numbers and then noting after-the-fact that they do it themselves on page 7, this results in a cow containing 638.2 pounds of meat, of which 265.1 pounds are ground beef. This is where you have to decide if Bob can offset future ground beef with the other cuts contained in the cow or not (is he charged .41 cows or 1 cow per cow). Assuming he's making standard quarter pound burgers, this results in either 39.173 cows (assuming he's allowed to offset) or 94.304 (if all cows that contribute count).

That's way lower than I would have guessed beforehand.

Saturday, 12 April 2014

I was going to talk about statistics, but instead I'm going to talk about sports.

No, not really. Mostly it's statistics, just stapled onto a sports frame.

Firstly,

Sports.

I don't generally care about sports.

However, there was that Warren Buffett billion dollar thing, so it became slightly less annoying than it usually would be. The important thing to remember was to not actually watch any of the games, because they really don't matter.

"Huh?"

Yeah. Here's the secret: sports teams are just random number generators, and if they come up with a bigger number than their opponent, they win.

"But teams are made up of people..."

More RNGs. (Theorem left unproved due to obviousness: The output of the sum of multiple RNGs is itself a RNG.)

"that have different capabilities that influence how the game ends."

So you're telling me they're biased RNGs.

"And you have to incorporate details about how they're doing, and think about how pumped up they are for a given game, and all sorts of other things that influence the outcome!"

Got it. Teams are biased RNGs, and if they come up with a bigger number, then they win. Some RNGs have a built in bias, signifying that they're "a better team."

A Model.

The nice thing is that this is kind of already a solved problem. Latent ability models. Given a dataset representing the outcomes of many games, you can assign a score that represents how good a team is. This "ability" is "latent" because you don't know what value it has, or even what determines the value in terms of real-world qualities. But you can look at the data you have, and use it to decide if in a future match up, team A or team B would win. Here's a plot:

Once you've assigned abilities, then you expect the probability of a victory to be related to the difference of the abilities for the two teams. You use this logistic function to model that probability to ensure that you don't get absolute probabilities for most situations.

"But what about teams that don't play each other?"

Since the ability of team A is based on the differences in ability with all of the teams they did play. This means you don't need to know the full matrix of team A vs everyone else, you can use the teams A did play, and compare the scores. In other words, if A is better than B, and B is better than C, odds are good that A is better than C, even if you've never seen A and C compete.

There is some concern about the connectivity of the dataset that can cause problems. If the dataset is poorly connected, you can get scores where one subset is poorly scaled to the other subsets, as you don't have enough overlap to determine a good solution. So, what's the connectivity of my dataset look like?

The logistic shape is largely coincidental. Mathematically motivated, but not important now. This is also a realization for a single team (team 790), because I didn't care to trace out the full connectivity statistics.

The teams team A plays are a tiny fraction (about 1-2% of the full set of teams). However, the teams those teams play connects about 25% of the set. Three steps yield 95%. Four doesn't quite get to 100%, but I stopped caring at that point.

So since the data is reasonably well connected, you'll probably get a decent solution to the latent ability scores. The final check (that I totally did when I was developing things, thus proving validity, and not at all just ten minutes ago when I realized I should include this) was to compare the ability scores that I came up with the official "seeds," which is a dumb name for "ranking priors."

Noisy but fairly linear.

My Results.

That's all nice and all, but how does it work? First, the caveats: I wrote the LAM code and pushed some tests through in January, and then completely forgot about this whole thing until the day before the tournament started. This means I basically had to accept the simple LAM solutions, and go with it. I did a full year solution, and also broke the games down by month, using those solutions to look for trends (if the abilities were improving or decreasing). Finally, I used the scientific method of "which of these are bigger?" to choose who would win.

A side note unrelated to the statistics: dancing around team name strings was by far the most annoying part of this project. Everyone should use clear and concise integer team ids, like I did. It's the superior solution than inconsistent abbreviations.

Here's who I chose.

Not bad, as I even picked up a few of the upsets. Things fell apart a bit with the later games, but 38/63 isn't bad, right?

Donking Sports.

At least I beat Julie, which was really what my goal was, if I couldn't win a billion dollars.

One of the things that I probably should have researched is the fact that people who do sports all the time are dumb. Therefore, later games magically get more points than early games, so you aren't just trying to have more correct picks than the other people. I suspect this is due to someone getting pissed off that they chose the winner correct, but donked up everything else, and decided that they should win instead of someone who got most of the first round correct. I can't come up with any other logic for it (if you get round 1 picks wrong, you automatically lose out on any subsequent round outcomes that are based on that incorrect pick. Why add extra penalties?).

In any case, you can see that I really wasn't too far back in terms of number correct. You'll also note that I am actually #12, not #13, unless there's some other dumb tie-breaker rule that says that even though 12 and I have the same number of points, and I have more correct picks, I don't win because sports.

I also checked earlier in the week, and I didn't beat Barack Obama, regardless of whether you do number correct or this stupid point system. I don't remember the values I got. Like 40 correct and 68 points, maybe? Whatever, I'm losing interest in this post anyway.

How Can This Be Improved?

As I mentioned above, I kind of rushed at the last minute, and spent more time dancing strings around than I wanted. This means the theory time and development was more rushed than I would have wanted. In any case, here are the obvious ways to improve things.

Wichita State. After determining everything, I later heard on NPR that WS was undefeated in the regular season. "Great?" Nope. Because they never lost, the LAM had no reason not to rank them higher than everyone else. Basically, no matter how high a score some other team got, WS would always rank higher.
Time evolution. My month-by-month breakout is way too chunky, and I'm pretty sure it didn't equally distribute the data. Equal game count bins would be better, and that can be tuned to yield the minimum size that still provides a connected data set.
Game quality. The dataset I have contains scores for each game. It's easy to imagine a LAM extension in which the distribution of score differences by ability differences tells you something about the distribution of that teams ability. The other option would be to use that score difference to amplify a given win. If the victory was by 1000 points, you can probably assume that the winner is "more better" than the loser than if they'd only won by 1 point.
Field advantage. It also contains home/away status. This might have a similar effect, where you don't weight home victories as much as away or something.
Inter-year convolution. I have lots of years. Teams are comprised of players, and players are not new each year. Therefore, you could convolve previous and subsequent year abilities with this year's, and see what that gives you.
Data. I have lots of years. I should really have been applying the algorithm to all the years, and checking the results against what really happened. Beyond just this one tournament, women's sports are identical, so that's another set of years with data and results that can be used to train the algorithm.

Wednesday, 26 March 2014

I accuse Mr. Green of committing the crime in the Ballroom with the Lead Pipe!

Or, more to the actual topic of this post, I accuse old people of opposing progress with their not dying.

Confused yet?

So I came across something today that pointed me to this survey of gay marriage polling tracks over time. I then noticed that this page also includes historical trends for interracial marriage as well.

My obvious question to myself was, "how much of that trend can you explain simply by assuming things are improving because old people die?"

That led to a search for detailed cohort data, because I need to know when people are dying. Google google google, and I find this page, which claims the image is from the Census people, although they've since rearranged their webpage so the link doesn't work anymore. I'm accepting that, because who lies about cohort data? Crazy statisticians, I guess.

Here's the cohort image, copied in case that link disappears too.

I digitized that image, and then write a perl script to calculate the fraction of the population on a given date that were born before a certain year. This number subtracted from 100 gives the fraction born after that date. What does that look like?

Due to the five-year blocking of the cohorts, there's a stair-step effect. Interpolate a best fit line through that data, basically through the midpoints of each step.

Huh. That worked far better than I expected. Basically, ignoring all other factors, you can pretty much explain the entire trend if you just assume that everyone born after 1960 (who would have been kids during the Civil Rights Movement) is cool with interracial marriage, and that everyone born after 1980 (I'm going to use the Simpsons as my cultural shift here, because I don't want to go with "people my age") is cool with gay marriage.

Wednesday, 12 March 2014

PCA, qualitatively.

Basically, it's kind of like this.

Wednesday, 5 February 2014

Topology?

Topology is not my best math. Sure, I solved the bridges of Koningsberg problem in high school as you do, and I can feign a laugh at the doughnut/coffee cup trick ~~that isn't decades old and seriously come up with a new joke already~~, and I realized what the shape of the Space War universe was while playing the game on an Atari 2600 with no sun and no bouncing (also: old video games always had kick ass covers).

Why are we talking about topology? In E17, I like to have a number of virtual desktops, and I like them to be all connected to each other so I can flip between them by pushing on the edge of the screen. FVWM did this years ago, and I always liked the idea. E17 also supports wrapping of virtual desktops, so you end up with something like this:

My laptop uses a three by three grid, but it's the same thing.

Where there is a two-by-two grid of desktops, each connected to the others along wrapped edges. This is obviously identical to the SpaceWar universe, as you can imagine this as just a really crappy tv with only four pixels. Therefore, my standard desktop arrangement is homeomorphic with a torus. Easy peasy.

Now, at work, I have a dual monitor setup, and I assumed this would just work the same, but with fat desktops. However, I discovered upon getting everything configured that it's actually this far more complicated thing:

Green arrows are bidirectional, as they were above. Orange are unidirectional.

The two monitors have their own set of virtual desktops, but, since going across the monitor boundary must put you on the other monitor, they're not connected together. If you're on the left monitor, there is no right edge you can cross to get to left side of the adjacent desktop (imagine the center red screen to pink). You can get there by going to the unopposed left edge of this desktop to arrive on the right edge of that desktop, but you then can't go back, as you jump over to the other monitor.

I don't have proof (again, see "worst math"), but I believe this defines a very oddly connected hollow torus. I don't know how you construct a unidirectional connection in topology, but assuming that's valid, you basically have some sorted of ratcheted torus for each monitor: you can travel poloidally unobstructed (corresponding to vertical shifts), but can only travel in one direction toroidally. Switching between monitors is then functionally equivalent to passing through the ratchet onto the other side of the torus. There's some ambiguity, since you can then shift each monitor separately, so where you come out on the other side of the torus isn't a priori obvious, but I think the bulk of the thing is sorted.

Anyway, I thought it was cool, and now my already-confusing-to-other-people desktop is going to get more confusing.