Dragoon’s Markov Chain

Dragoon is a simple Markov-chain based post generator for App.net. At a high level, it loads recent posts by the current user from App.net until there’s at least 201 posts loaded, or no more to load. It then builds up a chain and generates a string from it that’s no more than 247 characters if ‘Include #Dragoon in the posts?’ is selected or 256 characters otherwise.

Depending on how the process was instigated, it is then either displayed to the user or directly posted to App.net. Then hilarity happens.

There are two main parts to the post generation: creating the markov chain and creating the actual post. I’m going to provide code and details on most of those parts. The glue code is left as an exercise for the reader.

Chain Generation

For those that don’t recognise the language I’m using, it’s Hack.

protected function buildChain(Traversable<string> $posts): Chain {
    $chain = Map {};
    foreach ($posts as $post) {
        $words = preg_split('/s+/m', trim($post));
        $words = array_filter($words, $word ==> $word[0] != '@' && $word[0] != '#');
        $previous = '';
        foreach ($words as $word) {
            if (!$chain->contains($previous)) $chain[$previous] = Map {};
            if (!$chain[$previous]->contains($word)) {
                $chain[$previous][$word] = 1;
            } else {
                $chain[$previous][$word]++;
            }
            $previous = $word;
        }
        if (!$chain->contains($previous)) $chain[$previous] = Map {};
        if (!$chain[$previous]->contains('')) {
            $chain[$previous][''] = 1;
        } else {
            $chain[$previous]['']++;
        }
    }
    return $chain;
}

Chain is a typedef for Map<string, Map<string, int>>.

The logic of the function is contained inside the two foreach loops. Starting with the outer one:

$words = preg_split('/s+/m', trim($post));
$words = array_filter($words, $word ==> $word[0] != '@' && $word[0] != '#');
$previous = '';

This takes the post, splits it into words and then removes hashtags and mentions. The use of a regex here for splitting is just to stop empty words from showing up. It then sets the previous word to the empty string. This value is used to denote the beginning (or end) of a string.

The code then enters the inner foreach loop:

if (!$chain->contains($previous)) $chain[$previous] = Map {};
if (!$chain[$previous]->contains($word)) {
    $chain[$previous][$word] = 1;
} else {
    $chain[$previous][$word]++;
}
$previous = $word;

This simply increments the count of the previous-current word pair, then sets the previous word to the current one. Finally, we reach return to the outer loop:

if (!$chain->contains($previous)) $chain[$previous] = Map {};
if (!$chain[$previous]->contains('')) {
    $chain[$previous][''] = 1;
} else {
    $chain[$previous]['']++;
}

This increments the count of the final word-end of string pair. This completes the parsing of the posts. Once all posts have been parsed, we then return the generated Markov chain.

Post generation

There are two methods to this part. Staring with the outer one.

protected function buildText(Chain $chain): string {
    $text = '';
    $word = '';
    do {
        $word = $this->nextWord($word, $chain);
        $text .= $word . ' ';
    } while($word != '');
    return trim($text);
}

This keeps calling the nextWord method, passing in the previously returned word, until it reaches the end of string marker.

protected function nextWord(string $previous, Chain $chain): string {
    if (!$chain->contains($previous)) {
        return '';
    }
    $block = $chain[$previous];
    $total = array_sum($block);
    $num = mt_rand(1, $total);
    $w = '';
    foreach($block as $w => $v) {
        $num -= $v;
        if ($num <= 0) {
            return $w;
        }
    }
    return $w;
}

This is the main logic of the post generator. It gets passed the previous word and the entire chain and returns the next word. Breaking it down further:

if (!$chain->contains($previous)) {
    return '';
}

If passed in a word that isn’t in the chain, immediately exit with the end of string marker.

$block = $chain[$previous];
$total = array_sum($block);
$num = mt_rand(1, $total);

Here we get the calculated data about words following the previous word (which is a map of next word to the frequency of the word), then sum up frequencies. A random number between 1 and that total (inclusive) is then picked to be used to decide what word to pick next.

$w = '';
foreach($block as $w => $v) {
    $num -= $v;
    if ($num <= 0) {
        return $w;
    }
}
return $w;

Finally, the next word is picked. This is done by going through the calculated data until we’ve seen the random number amount of words. This word is then return.

There you have it. That is how the Markov chain in Dragoon works.

PayPal: simon@simon.geek.nz

 

Leave a Reply