Programming from A to Z


about syllabus All example source code

Context-Free Grammar

Examples

CFG projects

Exercise ideas

Grammars

From Wikipedia: “Grammar is the study of the rules governing the use of a given natural language, and, as such, is a field of linguistics. Traditionally, grammar included morphology and syntax; in modern linguistics these subfields are complemented by phonetics, phonology, semantics, and pragmatics.”

A Context-Free Grammar is a set of rules that define the syntax of a language — what is and is not a valid sequence of “tokens”. Programming languages, for example, are context-free grammars — a compiler reads your code to make sure it conforms to specific rules and informs you of any errors. (These rules aren’t limited to the production of text, and can be used for music, visual designs, etc.) A context-free grammar, however, isn’t robust enough to describe the complexities of natural language. After all, they have no context! Natural languages can be described using Context-sensitive grammars, a concept introduced by Chomsky in the 50s.

Here, I will demontrate how to use libraries to generate text with a Context Free Grammar as well as program your own from scratch. want to learn how to define our own context free grammars and generate text from them. I’m going to give a short explanations, followed by code examples. For more detailed discussions, I would recommend Daniel Howe’s (creator of RiTa) explanation and Context-Free Grammars by Allison Parrish.

All “production rules” in a context-free grammar are of the form:

V -> w

where V is a single “non-terminal” and w is a sequence of “terminals” and “non-terminals”

A non-terminal symbol is simply another symbol which needs to be defined in another production rule. A terminal symbol does not need to be defined because that’s it, you’ve reached the end, what is here is what should be in the sentence itself. Here’s an incredibly simple CFG.

s -> a b
a -> 'the'
b -> c 'cat'
c -> 'happy'
c -> 'sad'

Anything in single quotes is a “terminal” element, meaning this is the end and no more substitutions are necessary. In the above “cat,” “happy,” and “sad” are all terminals. Anything that is not in quotes is non-terminal, or a “symbol.” The abve example has 4 symbols — s, a, b, c. The symbol “s” is commonly used to indicate “start” or “sentence.”

Notice how in the above set of rules the symbol c can be “happy” or “sad.” This is often express with an OR (pipe character) to combine these two rules:

c -> 'happy' | 'sad'

Again, this grammar is incredibly basic, and is just for demonstrating how the elements work. The only two valid “sentences” that conform to this grammar are:

the happy cat
the sad cat

The process of generating the above two sentences goes something like:

s becomes a b which becomes the c cat which then becomes the happy cat or the sad cat. Here, either “happy” or “sad” is picked randomly (with a 50% chance for each one.)

Tracery

A great way to get started experimenting with CFG text generation is to use a library. One of my favorites is Tracery by Kate Compton. With Tracery, all you need to do is setup your grammar as a JSON object. Here’s a very simple example grammar:

var story = {
  "start": ['Once upon a time, there was a #adj## #hero#'.],
  "hero": ['fairy', 'unicorn', 'dragon', 'bear'],
  "adj": ['smart', 'pretty', 'smelly', 'funny', 'angry']
}

Note how #adj# is surrounded by the # symbol. This indicates that it is a “non-terminal” element and should be replaced with a random element from the adj property of story. These types of replacement rules can be nested and result in complex outcomes. To generate the story, a grammar object is created from the above story data. Any given outcome can be “expanded” from a starting element (like #start#) and then “flattened” (meaning we just want the final result, not the full expansion tree.)

var grammar = tracery.createGrammar(story);
var expansion = grammar.flatten('#start#');

CFG with RiTa.js

The RiTa.js library also includes Context Free Grammars as a feature with the RiGrammar object. You can create a grammar in code by creating a grammar object and adding rules..

var grammar = new RiGrammar();
grammar.addRule('<start>', 'Once upon a time, there was a <adj> <hero>.');
grammar.addRule('<hero>', 'fairy | unicorn | dragon | bear');
grammar.addRule('<adj>', 'smart | pretty | smelly | funny | angry');

RiTa looks for a rule that begins with <start> to generate the text with expand().

var story = grammar.expand();

RiTa also allows you pass a third argument (weight) to each rule to increase the probability of a particular option being selected. And finally, RiTa allows you to load grammars from a file (formatted with JSON or YAML) with loadFrom().

Coding your own CFG

How do we build a data structure to track a context-free grammar if you want to program your own from scratch? Again, I’m going to use a JavaScript object just like with the concordance and N-gram analysis examples. Here we’ll take a production rule and map non-terminal characters to keys and the possible outcomes as an array of values.

var cfg = {
  's': [['a', 'b']],
  'a': [['the']],
  'b': [['c', 'cat']],
  'c': [['happy'], ['sad']]
};

The above may look a little strange to you. Notice how the value for each key is an array of arrays. Each element of the larger array is one possible outcome which is an array itself: the list of elements in that outcome. I should also note that the is no distinction between a “symbol” and a “terminal.” Everything is just a String, if there is a rule associated with that String then it is a symbol, if not, it’s terminal.

A generated sentence from the CFG is commonly referred to as an “expansion” and is a bit tricky to implement. We could try to just use a loop, iterating over a sentence and executing the production rules. The nested nature of the rules, however, introduces a great deal of complexity. An elegant strategy, however, is to call a function recursively expanding the same sentence over and over again until there are no non-terminal elements left. Let’s assume we have an empty array and some seed (often called an “axiom”).

var expansion = [];
var seed = 's';

We can write a function that receives both the array and the seed and calls itself recursively if a rule that matches that seed String is present.

function expand(start, expansion) {
  if (cfg.hasOwnProperty(start)) {
    // Grab one possible expansion
    var possible = rules[start];
    var random_expansion = choice(possible);
    // Call this method again with the current element
    for (var i = 0; i < random_expansion.length; i++) {
      expand(random_expansion[i], expansion);
    }
  // if the rule wasn't found, then it's a terminal:
  // append the string to the expansion
  } else {
    expansion.push(start);
  }
}

The full implementation of the CFG object can be found here. In addition, here are some sample grammars: creating a grammar in code, a grammar file, a grammar files as JSON, haiku grammar file as JSON. These grammars come from Allison Parrish and Daniel Howe.

Finally, a grammar file is something you can generate yourself based on some algorithm. For example, using this haiku grammar file from Daniel Howe, you could read in a source text, analyze the syllable counts of all the words, and rewrite the grammar file to reflect these new words. Here is a version of this grammar file using words from an Obama speech: generated_grammar.json.

Here is an example that uses a concordance to get al the words from a source file and the RiTa library to count syllables.

More on Recursion

If the concept of recursion in the expand() function above is confusing to you, this video about recursion in algorithmic drawing from The Nature of Code may help.

L-Systems

Finally, another example of a Context-Free Grammar is an L-System, a grammar used to generated growth patterns found in nature.

Exercise ideas