Wednesday, 15 January 2014

I am irrationally concerned with good statistics

k, statistics again.  The problem is that I saw this article today, which basically complains that "no one really means to use standard deviation, as people intrinsically want to use the mean absolute deviation" which is, of course, completely dumb.

First, no one would ever do mean absolute deviation in their head.  Here are some numbers: {-1 2 3 -5 1 400}.  If you had to guess another number that would belong to this set, you're going to guess like "dunno, zero maybe?"  You know that 400 is probably wrong, so you cut it out.  People don't do real means when they filter data.  It's some combination of a mode and median.  Choose a number that doesn't seem crazy.

Second, this mean absolute deviation tells you about where the 50% point falls.  Why that point?  The standard deviation is more inclusive, as it tells you that most (Q(1) = 68.change%) samples are closer to the central value.

Third, all that obvious stuff about moments analysis.

Anyway, time for plots.  These are the same idea as the ones from the previous post, just remade with more samples and different stats.  The horizontal lines are the true uncontaminated distribution sigma and the true fully contaminated sigma (sigma_uniform = sqrt((b - a)^2 / 12), because math).  First thing to note:  Actual sigma cleanly switches from the two extremes, as it really should.  Gaussian fits are best, but IQD and MAD are comparable up to the 50% contamination point.  MeanAD doesn't seem particularly good.  The full contamination end is biased, as I'm using a parametric model (that it's a Gaussian distribution).
Biased samples.  This nicely shows that IQD fails before MAD, and that Gaussian fits are reasonable up to 60% contamination.  MeanAD is again off kind of doing its own thing.  Median >>> mean for outlier rejection.

Friday, 10 January 2014

Fish sandwich is about this early, and twitter works right on tv because I see how the lazy weekend is bizarre conniving prostitutes prostitutes

-Or-
So, let's talk about Markov Chains, I guess.

This news story popped up in the RSS today, which reminded me of the long ago time in which Forum 2000 was a thing (note: Forum 2000 never worked like that.  It was all basically a mechanical turk).

I then remembered that you can download all your tweets from twitter, so blammo, I have the text corpus necessary to hack up something to do this.

So, those Markov Chains.  Here's the simple way to think of it:  starting from initial point A, there's a probability p_B that you'll move to point B, p_C that you'll move to C, etc.  However, there are also probabilities for all those points too, so you can chain things together.  Maybe go look at the wikipedia figure.  This isn't turning out to be that simple of an explanation.

In any case, if you let each "point" be a word in a tweet, then if you have a large sample of tweets from someone, you could imagine that you could construct fake tweets if you knew the probability that a given word follows another.  That's what I did in perl, and that's where the title came from.  I did a simple chain, where w_{i+1} is drawn directly from the probability distribution of words that follow word w_i.  There's a reset condition where if the set of words that follow w_i is empty, I restart the chain from the set of "words that start a tweet".  I've also forced the length of the generated string to be 140 characters, the standard tweet length (and forced everything lowercase, and removed all punctuation, etc, etc, etc).

Here are some more examples:
  • i was far the whole dinner but that @wholefoods sushi was aborted because its weird being even fucking clue what hes a cheater :-p theres no
  • hey verizon network that too hungry or more not sure that flasks were afraid the sun orbit simulator its now its someplace else too soggy as
  • happy birthday @jkru isnt helping anybody know and hes never a 12 16 garbage bags tied directly to work - british accent for being wasted its
  • rt @vgc_scott man that quote and cranky tonights dinner oh man its still preparing the oil is actively looking up a workaround to be going to
  • first time you dont want a walk an impenetrable island which is like it looks close eyes will always just woke up dry before i not having intermotrons
  • @jkru its all the music #fuckedupdream peter potomus cartoons are you your app its official word for liberty tsa approved for my implementation
  • dear this is what is terrible monsters outside who like fog and colby isnt even left there to the 100th day yay deposed monarchy ok so hopefully
Those are just nearly not gibberish enough for you to think that they're not computer generated.  The crazy thing about MCs is that they do a wonderful job of constructing sentences that are nearly grammatically correct.  This could probably be improved a lot, such as trying to pick words that have some influence from the second previous word or having it clearly mark where it's restarted the chain with a period or something.  Still, for like ten minutes of work, this isn't too bad.

Finally, an interesting sampling from printing out the chain transition probabilities:
fuck you 0.294392523364486 63 214
fuck it 0.0794392523364486 17 214
fuck fuck 0.0700934579439252 15 214
fuck yeah 0.0700934579439252 15 214
fuck up 0.0327102803738318 7 214
[...]
fuck #sandwichtweets 0.00467289719626168 1 214
fuck shitshitshitshitshitshitshitshitshitshitshitfuckshitshitfuck 0.00467289719626168 1 214
[...]
fucking ghost 0.00454545454545455 1 220
[...]
fuckity fuck 1 3 3