Friday, 10 January 2014

Fish sandwich is about this early, and twitter works right on tv because I see how the lazy weekend is bizarre conniving prostitutes prostitutes

-Or-
So, let's talk about Markov Chains, I guess.

This news story popped up in the RSS today, which reminded me of the long ago time in which Forum 2000 was a thing (note: Forum 2000 never worked like that.  It was all basically a mechanical turk).

I then remembered that you can download all your tweets from twitter, so blammo, I have the text corpus necessary to hack up something to do this.

So, those Markov Chains.  Here's the simple way to think of it:  starting from initial point A, there's a probability p_B that you'll move to point B, p_C that you'll move to C, etc.  However, there are also probabilities for all those points too, so you can chain things together.  Maybe go look at the wikipedia figure.  This isn't turning out to be that simple of an explanation.

In any case, if you let each "point" be a word in a tweet, then if you have a large sample of tweets from someone, you could imagine that you could construct fake tweets if you knew the probability that a given word follows another.  That's what I did in perl, and that's where the title came from.  I did a simple chain, where w_{i+1} is drawn directly from the probability distribution of words that follow word w_i.  There's a reset condition where if the set of words that follow w_i is empty, I restart the chain from the set of "words that start a tweet".  I've also forced the length of the generated string to be 140 characters, the standard tweet length (and forced everything lowercase, and removed all punctuation, etc, etc, etc).

Here are some more examples:
  • i was far the whole dinner but that @wholefoods sushi was aborted because its weird being even fucking clue what hes a cheater :-p theres no
  • hey verizon network that too hungry or more not sure that flasks were afraid the sun orbit simulator its now its someplace else too soggy as
  • happy birthday @jkru isnt helping anybody know and hes never a 12 16 garbage bags tied directly to work - british accent for being wasted its
  • rt @vgc_scott man that quote and cranky tonights dinner oh man its still preparing the oil is actively looking up a workaround to be going to
  • first time you dont want a walk an impenetrable island which is like it looks close eyes will always just woke up dry before i not having intermotrons
  • @jkru its all the music #fuckedupdream peter potomus cartoons are you your app its official word for liberty tsa approved for my implementation
  • dear this is what is terrible monsters outside who like fog and colby isnt even left there to the 100th day yay deposed monarchy ok so hopefully
Those are just nearly not gibberish enough for you to think that they're not computer generated.  The crazy thing about MCs is that they do a wonderful job of constructing sentences that are nearly grammatically correct.  This could probably be improved a lot, such as trying to pick words that have some influence from the second previous word or having it clearly mark where it's restarted the chain with a period or something.  Still, for like ten minutes of work, this isn't too bad.

Finally, an interesting sampling from printing out the chain transition probabilities:
fuck you 0.294392523364486 63 214
fuck it 0.0794392523364486 17 214
fuck fuck 0.0700934579439252 15 214
fuck yeah 0.0700934579439252 15 214
fuck up 0.0327102803738318 7 214
[...]
fuck #sandwichtweets 0.00467289719626168 1 214
fuck shitshitshitshitshitshitshitshitshitshitshitfuckshitshitfuck 0.00467289719626168 1 214
[...]
fucking ghost 0.00454545454545455 1 220
[...]
fuckity fuck 1 3 3

No comments:

Post a Comment