Let us use a mathematical structure called aMarkov
chainto model the statistical
likelihood of a word in a title being followed by some other word in a title.
Then, I could use that statistical information to generate new titles by
choosing the first word (at random) and then choosing subsequent words with a
frequency proportional to how those words are arranged in the original text.
This will give me a string of text that may not be in the original text, but
will share stylistic properties of the original text. While my first approach to
writing this generation code did explicitly compute the statistical properties
as described above, I later decided to avoid these explicit computations with
the trade off of a little bit more memory usage but much cleaner code. The code
and the compromise will be explained below.
As an illustration of how this Markov chain model might look for this list of
titles, consider this small subset of titles:
The Christmas Wish
The Client List
The Cold Heart of a Killer
The Color of Courage
The Color of Love: Jacey's Story
The Color Purple
We begin by constructing a list of unique words that appear in these titles:
"The", "Christmas", "Wish", "Client", "List", "Clique", "Cold", "Heart", "of",
"a", "Killer", "Colony", "Color", "Courage", "Love:", "Jacey's", "Story",
Next, for each word in this list, we count which words follow that word and with
which frequency. For example, the word "The" is followed by "Christmas",
"Clique", "Cold", "Colony" or "Con" 10% of the time, followed by "Client" 20% of
the time, or "Color" 30% of the time. In any text generated from this corpus, we
should expect the same statistical properties to hold, if we want to maintain a
similar style and feel as the original sample.
Here is the full graph of probabilities for the above sample text. Here, each
vertex represents a word that appears in the corpus. For each word, when that
word is followed by another word, the probability that this occurs is
represented by a labeled directed edge between those vertices.
Using this picture, we can try generating a few words by choosing to start with
the word "The" and for each new word desired, we'll roll a die and decide which
next node to move to accordingly. Some possible generated strings include each
title from the original list, but also the new titles "The Color of a Killer" or
"The Cold Heart of Love: Jacey's Story". With more data, a more complicated
graph can be constructed, and more interesting and varied strings can be
generated as a result. You should use as much data in your corpus as you can
That's roughly how the text generator works in general. However, I avoided
computing the probabilities for each word by instead storing a list of every
word following each word (including duplicates). Choosing from this list at
random will have the same consequence as rolling a die and choosing the next
word from a stored list of probabilities, but this way I don't have to go
through the trouble of computing those numbers or the cost of making a bunch of
floating point comparisons for each word.
Here is the code. Note thattitlesis
an array of strings already defined with the corpus list of titles: