Chomsky normal form
In formal language theory, a contextfree grammar G is said to be in Chomsky normal form (discovered by Noam Chomsky)^{[1]} if all of its production rules are of the form:^{[2]}^{:92–93,106}
 A → BC, or
 A → a, or
 S → ε,
where A, B, and C are nonterminal symbols, a is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε denotes the empty string. Also, neither B nor C may be the start symbol, and the third production rule can only appear if ε is in L(G), namely, the language produced by the contextfree grammar G.
Every grammar in Chomsky normal form is contextfree, and conversely, every contextfree grammar can be transformed into an equivalent one which is in Chomsky normal form and has a size no larger than the square of the original grammar's size.
Contents
Converting a grammar to Chomsky normal form
To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks on automata theory.^{[2]}^{:87–94}^{[3]}^{[4]}^{[5]} The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009).^{[6]}^{[note 1]} Each of the following transformations establishes one of the properties required for Chomsky normal form.
START: Eliminate the start symbol from righthand sides
Introduce a new start symbol S_{0}, and a new rule
 S_{0} → S,
where S is the previous start symbol. This doesn't change the grammar's produced language, and S_{0} won't occur on any rule's righthand side.
TERM: Eliminate rules with nonsolitary terminals
To eliminate each rule
 A → X_{1} ... a ... X_{n}
with a terminal symbol a being not the only symbol on the righthand side, introduce, for every such terminal, a new nonterminal symbol N_{a}, and a new rule
 N_{a} → a.
Change every rule
 A → X_{1} ... a ... X_{n}
to
 A → X_{1} ... N_{a} ... X_{n}.
If several terminal symbols occur on the righthand side, simultaneously replace each of them by its associated nonterminal symbol. This doesn't change the grammar's produced language.^{[2]}^{:92}
BIN: Eliminate righthand sides with more than 2 nonterminals
Replace each rule
 A → X_{1} X_{2} ... X_{n}
with more than 2 nonterminals X_{1},...,X_{n} by rules
 A → X_{1} A_{1},
 A_{1} → X_{2} A_{2},
 ... ,
 A_{n2} → X_{n1} X_{n},
where A_{i} are new nonterminal symbols. Again, this doesn't change the grammar's produced language.^{[2]}^{:93}
DEL: Eliminate εrules
An εrule is a rule of the form
 A → ε,
where A is not the grammar's start symbol.
To eliminate all rules of this form, first determine the set of all nonterminals that derive ε. Hopcroft and Ullman (1979) call such nonterminals nullable, and compute them as follows:
 If a rule A → ε exists, then A is nullable.
 If a rule A → X_{1} ... X_{n} exists, and each X_{i} is nullable, then A is nullable, too.
Obtain an intermediate grammar by replacing each rule
 A → X_{1} ... X_{n}
by all versions with some nullable X_{i} omitted. By deleting in this grammar each εrule, unless its lefthand side is the start symbol, the transformed grammar is obtained.^{[2]}^{:90}
For example, in the following grammar, with start symbol S_{0},
 S_{0} → AbB  C
 B → AA  AC
 C → b  c
 A → a  ε
the nonterminal A, and hence also B, is nullable, while neither C nor S_{0} is. Hence the following intermediate grammar is obtained:^{[note 2]}
 S_{0} → AbB  Ab
BAbB AbB C  B → AA 
AA  AAAεA AC AC  C → b  c
 A → a  ε
In this grammar, all εrules have been "inlined at the call site".^{[note 3]} In the next step, they can hence be deleted, yielding the grammar:
 S_{0} → AbB  Ab  bB  b  C
 B → AA  A  AC  C
 C → b  c
 A → a
This grammar produces the same language as the original example grammar, viz. {ab,aba,abaa,abab,abac,abb,abc,b,bab,bac,bb,bc,c}, but apparently has no εrules.
UNIT: Eliminate unit rules
A unit rule is a rule of the form
 A → B,
where A, B are nonterminal symbols. To remove it, for each rule
 B → X_{1} ... X_{n},
where X_{1} ... X_{n} is a string of nonterminals and terminals, add rule
 A → X_{1} ... X_{n}
unless this is a unit rule which has already been removed.
Order of transformations
Mutual preservation of transformation results 


Transformation X always perserves (✓) resp. may destroy (✗) the result of Y: 

_{X}\^{Y}  START  TERM  BIN  DEL  UNIT 
START  ✓  ✓  ✓  ✗  
TERM  ✓  ✗  ✓  ✓  
BIN  ✓  ✓  ✓  ✓  
DEL  ✓  ✓  ✓  ✗  
UNIT  ✓  ✓  ✓  (✓)^{*}  
^{*}UNIT preserves the result of DEL if START had been called before. 
When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will reintroduce a unit rule if it is applied after UNIT. The table shows which orderings are admitted.
Moreover, the worstcase bloat in grammar size^{[note 4]} depends on the transformation order. Using G to denote the size of the original grammar G, the size blowup in the worst case may range from G^{2} to 2^{2 G}, depending on the transformation algorithm used.^{[6]}^{:7} The blowup in grammar size depends on the order between DEL and BIN. It may be exponential when DEL is done first, but is linear otherwise. UNIT can incur a quadratic blowup in the size of the grammar.^{[6]}^{:5} The orderings START,TERM,BIN,DEL,UNIT and START,BIN,DEL,UNIT,TERM lead to the least (i.e. quadratic) blowup.
Example
The following grammar, with start symbol Expr, describes a simplified version of the set of all syntactical valid arithmetic expressions in programming languages like C or Algol60. Both number and variable are considered terminal symbols here for simplicity, since in a compiler frontend their internal structure is usually not considered by the parser. The terminal symbol "^" denoted exponentiation in Algol60.

Expr → Term  Expr AddOp Term  AddOp Term Term → Factor  Term MulOp Factor Factor → Primary  Factor ^ Primary Primary → number  variable  ( Expr ) AddOp → +  − MulOp → *  /
In step "START" of the above conversion algorithm, just a rule S_{0}→Expr is added to the grammar. After step "TERM", the grammar looks like this:

S_{0} → Expr Expr → Term  Expr AddOp Term  AddOp Term Term → Factor  Term MulOp Factor Factor → Primary  Factor PowOp Primary Primary → number  variable  Open Expr Close AddOp → +  − MulOp → *  / PowOp → ^ Open → ( Close → )
After step "BIN", the following grammar is obtained:

S_{0} → Expr Expr → Term  Expr AddOp_Term  AddOp Term Term → Factor  Term MulOp_Factor Factor → Primary  Factor PowOp_Primary Primary → number  variable  Open Expr_Close AddOp → +  − MulOp → *  / PowOp → ^ Open → ( Close → ) AddOp_Term → AddOp Term MulOp_Factor → MulOp Factor PowOp_Primary → PowOp Primary Expr_Close → Expr Close
Since there are no εrules, step "DEL" doesn't change the grammar. After step "UNIT", the following grammar is obtained, which is in Chomsky normal form:

S_{0} → number  variable  Open Expr_Close  Factor PowOp_Primary  Term MulOp_Factor  Expr AddOp_Term  AddOp Term Expr → number  variable  Open Expr_Close  Factor PowOp_Primary  Term MulOp_Factor  Expr AddOp_Term  AddOp Term Term → number  variable  Open Expr_Close  Factor PowOp_Primary  Term MulOp_Factor Factor → number  variable  Open Expr_Close  Factor PowOp_Primary Primary → number  variable  Open Expr_Close AddOp → +  − MulOp → *  / PowOp → ^ Open → ( Close → ) AddOp_Term → AddOp Term MulOp_Factor → MulOp Factor PowOp_Primary → PowOp Primary Expr_Close → Expr Close
The N_{a} introduced in step "TERM" are PowOp, Open, and Close. The A_{i} introduced in step "BIN" are AddOp_Term, MulOp_Factor, PowOp_Primary, and Expr_Close.
Alternative definition
Chomsky reduced form
Another way^{[2]}^{:92}^{[7]} to define the Chomsky normal form is:
A formal grammar is in Chomsky reduced form if all of its production rules are of the form:
 or
 ,
where , and are nonterminal symbols, and is a terminal symbol. When using this definition, or may be the start symbol. Only those contextfree grammars which do not generate the empty string can be transformed into Chomsky reduced form.
Floyd normal form
In a paper where he proposed a term BackusNaur Form (BNF), Donald E. Knuth implied a BNF "syntax in which all definitions have such a form may be said to be in "Floyd Normal Form","
 or
 or
 ,
where , and are nonterminal symbols, and is a terminal symbol, because Robert W. Floyd found any BNF syntax can be converted to the above one in 1961.^{[8]} But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note."^{[8]}
Application
Besides its theoretical significance, CNF conversion is used in some algorithms as a preprocessing step, e.g., the CYK algorithm, a bottomup parsing for contextfree grammars, and its variant probabilistic CKY.^{[9]}
See also
 BackusNaur form
 CYK algorithm
 Greibach normal form
 Kuroda normal form
 Pumping lemma for contextfree languages — its proof relies on the Chomsky normal form
Notes
 ↑ For example, Hopcroft, Ullman (1979) merged TERM and BIN into a single transformation.
 ↑ indicating a kept and omitted nonterminal N by N and
N, respectively  ↑ If the grammar had a rule S_{0} → ε, it could not be "inlined", since it had no "call sites". Therefore it couldn't be deleted in the next step.
 ↑ i.e. written length, measured in symbols
References
 ↑ Chomsky, Noam (1959). "On Certain Formal Properties of Grammars" (PDF). Information and Control. 2: 137–167. doi:10.1016/S00199958(59)903626.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ ^{2.0} ^{2.1} ^{2.2} ^{2.3} ^{2.4} ^{2.5} Hopcroft, John E.; Ullman, Jeffrey D. (1979). Introduction to Automata Theory, Languages and Computation. Reading, Massachusetts: AddisonWesley Publishing. ISBN 020102988X.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Hopcroft, John E.; Motwani, Rajeev; Ullman, Jeffrey D. (2006). Introduction to Automata Theory, Languages, and Computation (3rd ed.). AddisonWesley. ISBN 0321455363.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles> Section 7.1.5, p.272
 ↑ Elaine Rich (2007). Automata, Computability, and Complexity: Theory and Applications (1st ed.). PrenticeHall. ISBN 9780132288064.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>^{[page needed]}
 ↑ Ingo Wegener (1993). Theoretische Informatik  Eine algorithmenorientierte Einführung. Leitfäden und Mongraphien der Informatik. Stuttgart: B.G. Teubner. ISBN 9783519021230.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles> Section 6.2 "Die ChomskyNormalform für kontextfreie Grammatiken", p.149152
 ↑ ^{6.0} ^{6.1} ^{6.2} Lange, Martin; Leiß, Hans (2009). "To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm" (PDF). Informatica Didactica. 8. External link in
journal=
(help)<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>  ↑ Hopcroft et al. (2006)^{[page needed]}
 ↑ ^{8.0} ^{8.1} Knuth, Donald E. (December 1964). "Backus Normal Form vs. Backus Naur Form". Communications of the ACM. 7 (12): 735–736. doi:10.1145/355588.365140.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
 ↑ Jurafsky, Daniel; Martin, James H. (2008). Speech and Language Processing (2nd ed.). Pearson Prentice Hall. p. 465. ISBN 9780131873216.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
Further reading
 Cole, Richard. Converting CFGs to CNF (Chomsky Normal Form), October 17, 2007. (pdf) — uses the order TERM, BIN, START, DEL, UNIT.
 John Martin (2003). Introduction to Languages and the Theory of Computation. McGraw Hill. ISBN 0072322004.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles> (Pages 237–240 of section 6.6: simplified forms and normal forms.)
 Michael Sipser (1997). Introduction to the Theory of Computation. PWS Publishing. ISBN 053494728X.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles> (Pages 98–101 of section 2.1: contextfree grammars. Page 156.)
 Sipser, Michael. Introduction to the Theory of Computation, 2nd edition.