Chomsky Hierarchy Theory

Noam Chomsky on Languages. Source: Shameem. 2018.

Any language is a structured medium of communication whether it is a spoken or written natural language, sign or coded language, or a formal programming language. Languages are characterised by two basic elements – syntax (grammatical rules) and semantics (meaning). In some languages, the meaning might vary depending upon a third factor called context of usage.

Depending on restrictions and complexity present in the grammar, languages find a place in the hierarchy of formal languages. Noam Chomsky, celebrated American linguist cum cognitive scientist, defined this hierarchy in 1956 and hence it's called Chomsky Hierarchy.

Although his concept is quite old, there's renewed interest because of its relevance to Natural Language Processing. Chomsky hierarchy helps us answer questions like “Can a natural language like English be described (‘parsed’, ‘compiled’) with the same methods as used for formal/artificial (programming) languages in computer science?”

Discussion

What are the different levels in the Chomsky hierarchy?

Chomsky Hierarchy Levels. Source: Fitch. 2014.

There are 4 levels – Type-3, Type-2, Type-1, Type-0. With every level, the grammar becomes less restrictive in rules, but more complicated to automate. Every level is also a subset of the subsequent level.
- Type-3: Regular Grammar - most restrictive of the set, they generate regular languages. They must have a single non-terminal on the left-hand-side and a right-hand-side consisting of a single terminal or single terminal followed by a single non-terminal.
- Type-2: Context-Free Grammar - generate context-free languages, a category of immense interest to NLP practitioners. Here all rules take the form A → β, where A is a single non-terminal symbol and β is a string of symbols.
- Type-1: Context-Sensitive Grammar - the highest programmable level, they generate context-sensitive languages. They have rules of the form α A β → α γ β with A as a non-terminal and α, β, γ as strings of terminals and non-terminals. Strings α, β may be empty, but γ must be nonempty.
- Type-0: Recursively enumerable grammar - are too generic and unrestricted to describe the syntax of either programming or natural languages.
What are the common terms and definitions used while studying Chomsky Hierarchy?
- Symbol - Letters, digits, single characters. Example - A,b,3
- String - Finite sequence of symbols. Example - Abcd, x12
- Production Rules - Set of rules for every grammar describing how to form strings from the language that are syntactically valid.
- Terminal - Smallest unit of a grammar that appears in production rules, cannot be further broken down.
- Non-terminal - Symbols that can be replaced by other non-terminals or terminals by successive application of production rules.
- Grammar - Rules for forming well-structured sentences and the words that make up those sentences in a language. A 4-tuple G = (V , T , P , S) such that V = Finite non-empty set of non-terminal symbols, T = Finite set of terminal symbols, P = Finite non-empty set of production rules, S = Start symbol
- Language - Set of strings conforming to a grammar. Programming languages have finite strings, most natural languages are seemingly infinite. Example – Spanish, Python, Hexadecimal code.
- Automaton - Programmable version of a grammar governed by pre-defined production rules. It has clearly set computing requirements of memory and processing. Example – Regular automaton for regex.
What are the corresponding language characteristics in each level?

Languages, Automaton, Grammar, Recognition. Source: Hauser and Watumull 2016, fig. 1.

Under Type-3 grammar, we don't classify entire languages as the production rules are restrictive. However, constructs describable by regular expressions come under this type.

For instance, rule for naming an identifier in a programming language – regular expression with any combination of case-insensitive letters, some special characters and numbers, but must start with a letter.

Context-free languages classified as Type-2 are capable of handling an important language construct called nested dependencies. English example – Recursive presence of “If <phrase> then <phrase>” – “If it rains today and if I don’t carry an umbrella, then I'd get drenched”. For programming languages, the matching parentheses of functions or loops get covered by this grammar.

In Type-1 languages, placing the restriction on productions α → β of a phrase structure that β be at least as long as α, they become context sensitive. They permit replacement of α by β only in a ‘context’, [context] α [context] → [context] β [context].

Finally, Type-0 languages have no restrictions on their grammar and may loop forever. They don’t have an algorithm enumerating all the elements.
What are the type of Automaton that recognizes the grammar in each level?
- Type-3: Finite-State Automata - To compute constructs for a regular language, the most important consideration is that there is no memory requirement. Think of a single purpose vending machine for platform tickets or a lift algorithm. The automaton knows the present state and next permissible states, but does not ‘remember’ past steps.
- Type-2: Push-Down Automata - In order to match nested dependencies, this automaton requires a one-ended memory stack. For instance, to match the number of ‘if’ and ‘else’ phrases, the automaton needs to ‘remember’ the latest occurring ‘if’. Only then it can find the corresponding ‘else’.
- Type-1: Linear-Bounded Automata - is a form of a restricted Turing machine which instead of being unlimited, is bounded by some computable linear function. The advantage of this automaton is that its memory requirement (RAM upper limit) is predictable even if the execution is recursive in parts.
- Type-0: Turing Machine - Non-computable functions exist in Mathematics and Computer Science. The Turing machine however allows representing even such functions as a sequence of discrete steps. Control is finite even if data might be seemingly infinite.
Can you give a quick example for each type of grammar/language?
- Type-3: Regex to define tokens such as identifiers, language keywords in programming languages. A coin vending machine that accepts only 1-Rupee, 2-Rupee and 5-Rupee coins has a regular language with only three words – 1, 2, 5.
- Type-2: Statement blocks in programming languages such as functions in parentheses, If-Else, for loops. In natural language, nouns and their plurals can be recognized through one NFA, verbs and their different forms can be recognized through another NFA, and then combined. Singular (The girl runs home –> Girl + Runs). Plural (The girls run home –> Girls + Run)
- Type-1: Though most language constructs in natural language are context-free, in some situations linear matching of tokens has to be done, such as - "The square roots of 16, 9 and 4 are 4, 3 and 2, respectively." Here 16 is to be matched with 4, 9 is matched with 3, and 4 is matched with 2.
- Type-0: A language with no restrictions is not conducive to communication or automation. Hence there are no common examples for this type. However, some mathematical seemingly unsolvable equations are expressed in this form.
In which level of the hierarchy do formal programming languages fall?

Reading a text file containing a high-level language program and compiling it as per its syntax is done in two steps.

Finite state models associated with Type-3 grammar are used for performing the first step of lexical analysis. Raw text is aggregated into keywords, strings, numerical constants and identifiers in this step.

In the second step, to parse the program constructs of any high level language according to its syntax, a Context-Free Grammar is required. Usually these grammars are specified in Backus-Naur Form (BNF).

For example, to build a grammar for IF statement, grammar would begin with a non-terminal statement S. Rules will be of the form:

S → IF-STATEMENT

IF-STATEMENT → if CONDITION then BLOCK endif

BLOCK → STATEMENT | BLOCK;

Conventionally, all high-level programming languages can be covered under the Type-2 grammar in Chomsky’s hierarchy.

Python language has a unique feature of being white-space sensitive. To make this feature fit into a conventional CFG, Python uses two additional tokens ‘INDENT’ and ‘DEDENT’ to represent the line indentations.

However, just syntactic analysis does not guarantee that the language will be entirely ‘understood’. Semantics need to match too.
Where can we place natural languages in the hierarchy?

Natural languages are an infinite set of sentences constructed out of a finite set of characters. Words in a sentence don’t have defined upper limits either. When natural languages are reverse engineered into their component parts, they get broken down into four parts - syntax, semantics, morphology, phonology.

Tokenising words and identifying nested dependencies work as explained in the previous section.

Part-of-Speech Tagging is a challenge. “He runs 20 miles every day” and “The batsman scored 150 runs in one day” – the same word ‘runs’ becomes a noun and verb. Finite state grammars can be used for resolving such lexical ambiguity.

Identifying cases (subjective - I, possessive - Mine, objective - Me, etc) for nouns varies across languages: Old English (5), Modern English (3), Sanskrit and Tamil (8). Each case also has interrogative forms. Clear definition of cases enables free word order. The CFG defined for these languages take care of this.

Natural languages are believed to be at least context-free. However, Dutch and Swiss German contain grammatical constructions with cross-serial dependencies which make them context sensitive.

Languages having clear and singular source text of grammar are easier to classify.
Are there any exceptional cases in natural languages that make its classification ambiguous?

NLP practitioners have successfully managed to assign a majority of natural language aspects to the regular and CFG category. However, some aspects don't easily conform to a particular grammar and require special handling.
- Structural ambiguity – Example ‘I saw the man with the telescope’. A CFG can assign two or more phrase structures (“parse trees”) to one and the same sequence of terminal symbols (words or word classes).
- Ungrammatical speech – Humans often talk in sentences that are incorrect grammatically. Missing words sometimes are implied in a sentence, not uttered explicitly. So decoding such sentences is a huge challenge as they don't qualify as per any defined grammar, but a native speaker can easily understand them.
- Sarcasm or proverb usage – When we say something but mean something entirely different. Here the semantic analysis becomes critical. We don’t build grammars for these cases, we just prepare an exhaustive reference data set.
- Mixed language use – Humans often mix words from multiple languages. So computing systems need to identify all the constituent language words present in the sentence and then assign them to their respective grammars.
What are the important extensions to Chomsky hierarchy that find relevance in NLP?

Mildly Context Sensitive Languages. Source: Jäger and Rogers. 2012.

There are two extensions to the traditional Chomsky hierarchy that have proved useful in linguistics and cognitive science:
- Mildly context-sensitive languages - CFGs are not adequate (weakly or strongly) to characterize some aspects of language structure. To derive extra power beyond CFG, a grammatical formalism called Tree Adjoining Grammars (TAG) was proposed as an approximate characterization of Mildly Context-Sensitive Grammars. It is a tree generating system that factors recursion and the domain of dependencies in a novel way leading to 'localization' of dependencies, their long distance behaviour following from the operation of composition, called 'adjoining'. Another classification called Minimalist Grammars (MG) describes an even larger class of formal languages.
- Sub-regular languages - A sub-regular language is a set of strings that can be described without employing the full power of finite state automata. Many aspects of human language are manifestly sub-regular, such as some ‘strictly local’ dependencies. Example – identifying recurring sub-string patterns within words is one such common application.

Milestones

1928

Avram Noam Chomsky is born on December 7, 1928. Decades later, Chomsky is credited with the creation of the theory of generative grammar, considered to be one of the most significant contributions to the field of linguistics made in the 20th Century.

1936

Turing machines, first described by Alan Turing in 1936–7, are simple abstract computational devices intended to help investigate the extent and limitations of what can be computed.