# Markov Chain in JavaScript

N-gram:

## What is Markov Chain?

Let us  use a mathematical structure called a Markov chain to model the statistical likelihood of a word in a title being followed by some other word in a title. Then, I could use that statistical information to generate new titles by choosing the first word (at random) and then choosing subsequent words with a frequency proportional to how those words are arranged in the original text. This will give me a string of text that may not be in the original text, but will share stylistic properties of the original text. While my first approach to writing this generation code did explicitly compute the statistical properties as described above, I later decided to avoid these explicit computations with the trade off of a little bit more memory usage but much cleaner code.  The code and the compromise will be explained below.

As an illustration of how this Markov chain model might look for this list of titles, consider this small subset of titles:

• The Christmas Wish
• The Client
• The Client List
• The Clique
• The Cold Heart of a Killer
• The Colony
• The Color of Courage
• The Color of Love: Jacey's Story
• The Color Purple
• The Con

We begin by constructing a list of unique words that appear in these titles: "The", "Christmas", "Wish", "Client", "List", "Clique", "Cold", "Heart", "of", "a", "Killer", "Colony", "Color", "Courage", "Love:", "Jacey's", "Story", "Purple", "Con".

Next, for each word in this list, we count which words follow that word and with which frequency. For example, the word "The" is followed by "Christmas", "Clique", "Cold", "Colony" or "Con" 10% of the time, followed by "Client" 20% of the time, or "Color" 30% of the time. In any text generated from this corpus, we should expect the same statistical properties to hold, if we want to maintain a similar style and feel as the original sample.

Here is the full graph of probabilities for the above sample text.  Here, each vertex represents a word that appears in the corpus.  For each word, when that word is followed by another word, the probability that this occurs is represented by a labeled directed edge between those vertices.

Using this picture, we can try generating a few words by choosing to start with the word "The" and for each new word desired, we'll roll a die and decide which next node to move to accordingly. Some possible generated strings include each title from the original list, but also the new titles "The Color of a Killer" or "The Cold Heart of Love: Jacey's Story". With more data, a more complicated graph can be constructed, and more interesting and varied strings can be generated as a result. You should use as much data in your corpus as you can find!

That's roughly how the text generator works in general. However, I avoided computing the probabilities for each word by instead storing a list of every word following each word (including duplicates). Choosing from this list at random will have the same consequence as rolling a die and choosing the next word from a stored list of probabilities, but this way I don't have to go through the trouble of computing those numbers or the cost of making a bunch of floating point comparisons for each word.

Here is the code. Note that titles is an array of strings already defined with the corpus list of titles: