Programming from A to Z


about syllabus All example source code

Word Counting and Text Analysis

Examples

Sample datasets:

Exercise ideas

Associative Arrays in JavaScript?

You know that thing we call an array? Yes, that’s right, an ordered list of data. Each element of an array is numbered and accessed by its numeric index.

var nameList = ['Jane', 'Sue', 'Bob'];
// Is your name Sue?
console.log('Is your name ' + nameList[1] + '?');

What if, however, instead of numbering the elements of an array we could name them? This element is named “Sue”, this one “Bob”, this one “Jane”, and so on and so forth. In programming, this kind of data structure is often referred to as an “associative array”, “map”, “hash” or “dictionary.” It’s a collection of key/value pairs. The key is Sue, the value is 24. It’s just like having a dictionary of words and when you look up, say, Sue the definition is 24.

Associative arrays can be incredibly convenient for various applications. For example, you could keep a list of student IDs (student name/id) or a list of prices (product name/price) in a dictionary. The fundamental building block of just about every text analysis application is a concordance, a list of all words in a document along with how many times each word occurred. A dictionary is the perfect data structure to hold this information. Each element of the dictionary consists of a String paired with a number.

Most programming languages and environments have specific classes or objects for a variety of data structures (a dictionary is just one example). JavaScript, however, does not. But all is not lost. Remember that thing called a JavaScript object?

var obj = {
  Sue: 24,
  Jane: 991,
  Bob: 12
};

That’s right. A JavaScript object is a collection of name-value pairs. And so while it might be more convenient to have a custom-tailored dictionary object, we’re going to be able to get all the functionality we need out of just a plain old object itself.

To start writing a concordance all we need is an empty object.

var concordance = {};

A value (in this case a count) can be paired with a word by naming the key as a String.

concordance['the'] = 100;
concordance['a'] = 10;
concordance['go'] = 50;

The above is just another way of writing:

var concordance = {
  the: 100,
  a: 10,
  go: 50
};

We’ll need this new way since we’ll be pulling the names for the object as strings from a source text.

Text Concordance

In the case of our examples, we’re going to take a text document, split it into an array of Strings and increase the value associated with a particular key (i.e. word) each time we encounter the same String. Let’s assume we have some text in a variable named data. First, we’ll split into word “tokens”.

// Using any non letter or number character as delimiter
var tokens = data.split(/\W+/);

Then we’ll go through each one a a time.

for (var i = 0; i < tokens.length; i++) {

The tricky thing here is we have to determine if each token (each element of the resulting array) is a new word or one we’ve already encountered. If it’s new, we need to set its initial count at 1. If it’s not, we need to increase its count by one.

for (var i = 0; i < tokens.length; i++) {
  var word = tokens[i];
  // It's a new word!
  if (concordance[word] === undefined) {
    concordance[word] = 1;
  // We've seen this word before!
  } else {
    concordance[word]++;
  }
}

There, we now have a concordance object that stores all the words and their counts! The question, however, remains: what do we do with this thing?

The first thing you might want to do is simply examine the results. For example, let’s say we wanted to display the most frequent words (in sorted order). Unfortunately, the fields of a JavaScript object have no order to them and cannot be easily sorted. One solution to this problem is to keep a separate array of all the keys. This array can be sorted and used to iterate over all the name/value pairs in the concordance object.

// Add another array to track keys
var keys = [];
for (var i = 0; i < tokens.length; i++) {
  var word = tokens[i];
  if (concordance[word] === undefined) {
    concordance[word] = 1;
    // When we have a new word, let's add to our keys array!
    keys.push(word);
  } else {
    concordance[word]++;
  }
}

While we could write our own sorting algorithm in JavaScript to sort the array, we might as well make use of the sort() function available as part of the Array prototype. The tricky thing here is that the sort function expects as an argument which a function itself!

keys.sort(function(a, b) {
  // what goes here??
});

This is pretty typical of JavaScript and functional programming. Here we have an anonymous function that we pass into the sort() function itself. This function takes two arguments: a and b. The function is a comparison function and should return true if element b should appear before a in the sorted result.

keys.sort(function(a, b) {
  if (concordance[b] > concordance[a]) {
    return true;
  } else {
    return false;
  }
});

This can be condensed since a positive number is evaluated as true and a negative one as false.

keys.sort(function(a, b) {
  return (concordance[b] - concordance[a]);
});

Now that we have sorted keys, we can iterate over the concordance.

for (var i = 0; i < keys.length; i++) {
  console.log(keys[i] + ': ' + concordance[keys[i]]);
}

Here is a text concordance example and its source code.

TF-IDF

One common application of a text concordance is TF-IDF or term frequency–inverse document frequency. Let’s consider a corpus of wikipedia articles. Is there a way we could automatically generate keywords or tags for an article based on its word counts?

TF-IDF has two components. Term frequency is one that we are already quite familiar with. How frequent is a given term in a document? This is exactly what we calculated in the concordance. We could stop here and say that keyword generation is: “The words that appear most frequently are most important in a document.” While there is some merit to this idea, what we’ll see is that the most frequent words are just the words that appear frequently in all text: junk words like ‘to’, ‘a’, ‘and’, ‘you’, ‘me’, etc. Ironically, these junk words may hold the key to unlocking a world of information about a particular text. Nevertheless, these are clearly not related to a document’s subject matter as keywords.

TF-IDF takes a different approach. Yes, a word that appears frequently in a document (TF) is one key indicator. But adding in another indicator such as inverse document frequency (is it a word that rarely appears in other documents?) takes the junk words out of the equation Let’s consider a wikipedia article about rainbows. Here are some of the counts:

the:      16
and:      6
rainbow:  5
droplets: 3

Using this as a keyword score alone is not enough since the most important word is ‘the’. Now let’s say we looked at five other wikipedia articles. Let’s now count how many articles each of these words appear at least once in.

the:      6
and:      6
rainbow:  1
droplets: 1

This is a somewhat obvious result: ‘the’ and ‘and’ appear in all the articles and ‘rainbow’ and ‘droplet’ appear in both. We could therefore compute a score for each of these as:

rainbow:   5 * (6/1)   30
droplets:  3 * (6/1)   18
the:      16 * (6/6)   16
and:       6 * (6/6)   6

Now we’re getting somewhere!

TF-IDF is meant to be run on a much larger corpus and in order to dampen the effect of the IDF value, a common solution is to use the logarithm of IDF.

rainbow:   5 * log(6/1)   3.89
droplets:  3 * log(6/1)   2.33
the:      16 * log(6/6)   0.0
and:       6 * log(6/6)   0.0

If logarithmic scale is new to you, this Khan Academy video may help. (Note how if a term appears in every single document the tf-idf score is always zero.)

We can improve this one more step by using not just the raw count of how many times a term (such as “rainbow”) appears in a document, but the ratio of of its count to the total number of words in the document. This normalizes the score by document length. So if the total number of words in the article is 100, the score would now be:

rainbow:  (5/100) * log(6/1)   0.0389
droplets: (3/100) * log(6/1)   0.0233
the:     (16/100) * log(6/6)   0.0
and:      (6/100) * log(6/6)   0.0

In the case of only examining this document it makes no difference, but if we were looking at the score for “rainbow” across multiple documents without this change the score would be biased towards longer documents.

For a wonderful example of TF-IDF out in the world, take a look at Nicholas Felton’s 2013 Annual Report.

Naive Bayesian Text Classification

Bayes’ Theorem:

p(A|B) = (p(B|A) * p(A)) / (p(B|A) * p(A) + p(B|~A) * p(~A) )

Consider the following scenario:

You have received a positive TID, what is the likelihood you have ITPosis?

As you might expect, there is a very precise answer to this question but it’s probably not what you initially guess. Bayesian reasoning is counter-intuitive and takes quite a bit of getting used to. In fact, when given a similar question related to breast cancer and mammograms, only 15% of doctors get the answer correct.

The answer — 15.3% — is calculated via Bayes’ Theorem. Let’s look at it again with this scenario:

This video illustrates the problem quite nicely.

The problem our brains run into are those rascally 90% and 95% numbers. 90% of students who test positive have the disease and 95% who don’t test negative, if I test positive, I should probably have it, right?!! The important thing to remember is that only 1% of students actually have the disease. Sure testing positive increases the likelihood, but because 5% of students without the disease receive a false positive, it only increases the chances to 15%. All of this is explained in incredibly thorough and wonderful detail in Eliezer Yudkowsky’s article An Intuitive Explanation of Bayesian Reasoning. My explanation is simply adapted from his.

By the way, we could have calculated it as follows:

  P (ITPosis | Positive TID) = (90% * 1%) / (90% * 1% + 5% * 99%)

This reads as “the probability that a positive TID means you have ITPosis” equals:

So why do we care? This type of reasoning can be applied quite nicely to text analysis. A common example is spam filtering. If we know the probability that a spam e-mail contains a specific words, we can calculate the likelihood that an e-mail is spam based on its concordance.

A wonderful resource for this approach is Paul Graham’s A Plan for Spam as well as Better Bayesian Filtering.

The example code that follows is not a perfect text classifier by any means. It’s a simple implementation of the idea that outlines the basic steps one might take to apply Bayesian Filtering to text.

A Word Object

The first thing we need to do is expand on the concordance example that stores a single number associated with each word. For classification, we’ll need to know things like how many times that word appears in spam e-mails versus good (aka ‘ham’) e-mails. And then we’ll need to use these values to calculate the probability that each word would appear in a spam or ham e-mail.

var word = {};
word.countA = 0;   // category A count
word.countB = 0;   // category B count
word.probA = ???;  // probability word appears in category A doc
word.probB = ???;  // probability word appears in category B doc
// etc. etc.

Instead of storing a single number like dictionary['the'] = 16; we now need to associate an object with multiple data points with each key.The process of running the filter works as follows:

  1. Train the filter with known category A (for example: spam) e-mails and known category B (ham) e-mails.
  2. For every word, check if it’s new. If it is add it, if not, simply increase the counter for “A” or “B” (depending on whether it’s found in A or B).

Here’s how this might look:

for (var i = 0; i < tokens.length; i++) {
  var token = tokens[i].toLowerCase();
  if (dictionary[token] === undefined) {
    dictionary[token] = {};
    dictionary[token].countA = 0;
    dictionary[token].countB = 0;
    dictionary[token].word = token;
  } else {
    // Which category are we training for?
    if (category === 'A') {
      this.dict[token].countA++;
      this.tokenCountA++;
    } else if (category === 'B') {
      this.dict[token].countB++;
      this.tokenCountB++;
    }
  }
}

The above steps are repeated over and over again for all training documents. Once all the “training” files are read, the probabilities can be calculated for every word.

Once we’ve gone through the process of counting the occurrences in each category (‘A’ or ‘B’, spam or ham, etc.), we can the calculate the probabilities according to Bayes rule.

// Ok, assuming we have an array of keys
for (var i = 0; i < keys.length; i++) {
  var key = keys[i];
  var word = dictionary[key];

  // Average frequency per document
  // (this assumes we've counted total documents)
  word.freqA = word.countA / docCountA;      
  word.freqB = word.countB / docCountB;      

  // Probability via Bayes rule
  word.probA = word.freqA / (word.freqA + word.freqB);
  word.probB = 1 - probA;
}

The above formula might look a little bit simpler to you than the original Bayes rule. This is because I am leaving out the “prior probability” and assuming that any document has a 50% chance of being category A or B.

Now, all that is left to do is take a new document, and compute the total probability for that document according to the formula specified in Graham’s essay. For this step, we need to calculate combined probability as outlined by Graham. For more about combined probability, here’s another resource.

// Combined probabilities
// http://www.paulgraham.com/naivebayes.html
var productA = 1;
var productB = 1;

// Multiply probabilities together
for (var i = 0; i < words.length; i++) {
  var word = words[i];
  productA *= word.probA;
  productB *= word.probB;
}

// Apply formula
var pA = productA / (productA + productB);

Now we know the probability the document is in category A!

One important aspect of this analysis that I’ve left out is the “interesting-ness” of any given word. An interesting rating is defined as how different, say, the spam probability is from 0.5 (i.e. 50/50 is as boring as it gets) or the absolute value of probA - 0.5. Graham’s spam filter, for example, only uses the probability of the top 15 most interesting words. If you are looking for an exercise, you might try adding this feature to the Bayesian classifier example.