Saturday, 12 April 2014

I was going to talk about statistics, but instead I'm going to talk about sports.

No, not really.  Mostly it's statistics, just stapled onto a sports frame.

Firstly,

Sports.

I don't generally care about sports.

However, there was that Warren Buffett billion dollar thing, so it became slightly less annoying than it usually would be.  The important thing to remember was to not actually watch any of the games, because they really don't matter.

"Huh?"

Yeah.  Here's the secret: sports teams are just random number generators, and if they come up with a bigger number than their opponent, they win.

"But teams are made up of people..."

More RNGs. (Theorem left unproved due to obviousness: The output of the sum of multiple RNGs is itself a RNG.)

"that have different capabilities that influence how the game ends."

So you're telling me they're biased RNGs.

"And you have to incorporate details about how they're doing, and think about how pumped up they are for a given game, and all sorts of other things that influence the outcome!"

Got it.  Teams are biased RNGs, and if they come up with a bigger number, then they win.  Some RNGs have a built in bias, signifying that they're "a better team."

A Model.

The nice thing is that this is kind of already a solved problem.  Latent ability models.  Given a dataset representing the outcomes of many games, you can assign a score that represents how good a team is.  This "ability" is "latent" because you don't know what value it has, or even what determines the value in terms of real-world qualities.  But you can look at the data you have, and use it to decide if in a future match up, team A or team B would win.  Here's a plot:

Once you've assigned abilities, then you expect the probability of a victory to be related to the difference of the abilities for the two teams.  You use this logistic function to model that probability to ensure that you don't get absolute probabilities for most situations.

"But what about teams that don't play each other?"

Since the ability of team A is based on the differences in ability with all of the teams they did play.  This means you don't need to know the full matrix of team A vs everyone else, you can use the teams A did play, and compare the scores.  In other words, if A is better than B, and B is better than C, odds are good that A is better than C, even if you've never seen A and C compete.

There is some concern about the connectivity of the dataset that can cause problems.  If the dataset is poorly connected, you can get scores where one subset is poorly scaled to the other subsets, as you don't have enough overlap to determine a good solution.  So, what's the connectivity of my dataset look like?
The logistic shape is largely coincidental.  Mathematically motivated, but not important now.  This is also a realization for a single team (team 790), because I didn't care to trace out the full connectivity statistics.
The teams team A plays are a tiny fraction (about 1-2% of the full set of teams).  However, the teams those teams play connects about 25% of the set.  Three steps yield 95%.  Four doesn't quite get to 100%, but I stopped caring at that point.

So since the data is reasonably well connected, you'll probably get a decent solution to the latent ability scores.  The final check (that I totally did when I was developing things, thus proving validity, and not at all just ten minutes ago when I realized I should include this) was to compare the ability scores that I came up with the official "seeds," which is a dumb name for "ranking priors."
Noisy but fairly linear.


My Results.

That's all nice and all, but how does it work?  First, the caveats:  I wrote the LAM code and pushed some tests through in January, and then completely forgot about this whole thing until the day before the tournament started.  This means I basically had to accept the simple LAM solutions, and go with it.  I did a full year solution, and also broke the games down by month, using those solutions to look for trends (if the abilities were improving or decreasing).  Finally, I used the scientific method of "which of these are bigger?" to choose who would win.

A side note unrelated to the statistics:  dancing around team name strings was by far the most annoying part of this project.  Everyone should use clear and concise integer team ids, like I did.  It's the superior solution than inconsistent abbreviations.

Here's who I chose.
Not bad, as I even picked up a few of the upsets.  Things fell apart a bit with the later games, but 38/63 isn't bad, right?

Donking Sports.

At least I beat Julie, which was really what my goal was, if I couldn't win a billion dollars.
One of the things that I probably should have researched is the fact that people who do sports all the time are dumb.  Therefore, later games magically get more points than early games, so you aren't just trying to have more correct picks than the other people.  I suspect this is due to someone getting pissed off that they chose the winner correct, but donked up everything else, and decided that they should win instead of someone who got most of the first round correct.  I can't come up with any other logic for it (if you get round 1 picks wrong, you automatically lose out on any subsequent round outcomes that are based on that incorrect pick.  Why add extra penalties?).

In any case, you can see that I really wasn't too far back in terms of number correct.  You'll also note that I am actually #12, not #13, unless there's some other dumb tie-breaker rule that says that even though 12 and I have the same number of points, and I have more correct picks, I don't win because sports.

I also checked earlier in the week, and I didn't beat Barack Obama, regardless of whether you do number correct or this stupid point system.  I don't remember the values I got.  Like 40 correct and 68 points, maybe?  Whatever, I'm losing interest in this post anyway.

How Can This Be Improved?

As I mentioned above, I kind of rushed at the last minute, and spent more time dancing strings around than I wanted.  This means the theory time and development was more rushed than I would have wanted.  In any case, here are the obvious ways to improve things.

  1. Wichita State.  After determining everything, I later heard on NPR that WS was undefeated in the regular season.  "Great?"  Nope.  Because they never lost, the LAM had no reason not to rank them higher than everyone else.  Basically, no matter how high a score some other team got, WS would always rank higher.
  2. Time evolution.  My month-by-month breakout is way too chunky, and I'm pretty sure it didn't equally distribute the data.  Equal game count bins would be better, and that can be tuned to yield the minimum size that still provides a connected data set.
  3. Game quality.  The dataset I have contains scores for each game.  It's easy to imagine a LAM extension in which the distribution of score differences by ability differences tells you something about the distribution of that teams ability.  The other option would be to use that score difference to amplify a given win.  If the victory was by 1000 points, you can probably assume that the winner is "more better" than the loser than if they'd only won by 1 point.
  4. Field advantage.  It also contains home/away status.  This might have a similar effect, where you don't weight home victories as much as away or something.
  5. Inter-year convolution.  I have lots of years.  Teams are comprised of players, and players are not new each year.  Therefore, you could convolve previous and subsequent year abilities with this year's, and see what that gives you.
  6. Data.  I have lots of years.  I should really have been applying the algorithm to all the years, and checking the results against what really happened.  Beyond just this one tournament, women's sports are identical, so that's another set of years with data and results that can be used to train the algorithm.