Context-free grammars¶
[1]:
from tock import *
Creating CFGs¶
You can create a CFG either by reading from a file (using Grammar.from_file
) or a list of strings (using Grammar.from_lines
). The first rule’s left-hand side is assumed to be the start symbol.
[2]:
g = Grammar.from_lines(["S -> a T b",
"S -> b",
"T -> T a",
"T -> &"])
g
[2]:
start: S
S → a T b
S → b
T → T a
T → ε
[3]:
g.nonterminals
[3]:
{'S', 'T'}
A Grammar
can be any unrestricted (type-0) grammar, but currently the grammar-reading functions can only read CFGs.
[4]:
g.is_contextfree()
[4]:
True
Like RegularExpression
objects, there isn’t a lot you can do with Grammar
objects other than convert them to automata.
From CFGs to PDAs¶
To convert to a PDA using a top-down construction, use the from_grammar
function:
[5]:
m = from_grammar(g)
to_graph(m)
[6]:
run(m, "a a b", show_stack=4)
This diagram is a little hard to read, but one thing to note is the cycle between loop
and 3.1
. This is caused by the left-recursive rule T -> T a
, which the automaton applies an unbounded number of times.
Shift-reduce parsing¶
There’s also a bottom-up version:
[7]:
m = from_grammar(g, mode="bottomup")
to_graph(m)
[8]:
run(m, "a a b")
LL(1) parsing¶
[9]:
g = Grammar.from_lines(["S -> a S c",
"S -> T",
"T -> b T",
"T -> &"])
m = from_grammar(g, mode='ll1')
m
[10]:
m.is_deterministic()
[10]:
True
[11]:
run(m, "a a b c c", show_stack=5).only_path()
[11]:
start | [a] a b c c | ε |
loop | [a] a b c c | [S] $ |
a | [a] b c c | [S] $ |
a | [a] b c c | [a] S c $ |
loop | [a] b c c | [S] c $ |
a | [b] c c | [S] c $ |
a | [b] c c | [a] S c c $ |
loop | [b] c c | [S] c c $ |
b | [c] c | [S] c c $ |
b | [c] c | [T] c c $ |
b | [c] c | [b] T c c $ |
loop | [c] c | [T] c c $ |
c | c | [T] c c $ |
c | c | [c] c $ |
loop | c | [c] $ |
c | ε | [c] $ |
loop | ε | $ |
⊣ | ε | $ |
accept | ε | ε |
LR parsing¶
Construction of an LR parser involves building the LR automaton:
[17]:
g = Grammar.from_lines([
'S -> T -|',
'T -> T l T r',
'T -> &'
])
m = from_grammar(g, mode='lr0')
m
[14]:
m.is_deterministic()
[14]:
True
[15]:
run(m, 'l l r l r r -|', show_stack=15).only_path()
[15]:
start' | [l] l r l r r ⊣ | ε |
start | [l] l r l r r ⊣ | 2 |
loop | [l] l r l r r ⊣ | [3] $ 2 |
loop | [l] l r l r r ⊣ | [7] T 3 $ 2 |
loop | [l] r l r r ⊣ | [1] l 7 T 3 $ 2 |
loop | [l] r l r r ⊣ | [4] T 1 l 7 T 3 $ 2 |
loop | [r] l r r ⊣ | [1] l 4 T 1 l 7 T 3 $ 2 |
loop | [r] l r r ⊣ | [4] T 1 l 4 T 1 l 7 T 3 $ 2 |
loop | [l] r r ⊣ | [5] r 4 T 1 l 4 T 1 l 7 T 3 $ 2 |
loop | [l] r r ⊣ | [4] T 1 l 7 T 3 $ 2 |
loop | [r] r ⊣ | [1] l 4 T 1 l 7 T 3 $ 2 |
loop | [r] r ⊣ | [4] T 1 l 4 T 1 l 7 T 3 $ 2 |
loop | [r] ⊣ | [5] r 4 T 1 l 4 T 1 l 7 T 3 $ 2 |
loop | [r] ⊣ | [4] T 1 l 7 T 3 $ 2 |
loop | ⊣ | [5] r 4 T 1 l 7 T 3 $ 2 |
loop | ⊣ | [7] T 3 $ 2 |
loop | ε | [6] ⊣ 7 T 3 $ 2 |
loop | ε | [8] S 3 $ 2 |
accept | ε | 2 |
You can visualize the LR automaton using grammars.lr_automaton
, and you can also build an LR(1) parser using mode='lr1'
.
From PDAs to CFGs¶
The conversion in the reverse direction, from PDA to CFG, is actually related to the algorithm that Tock uses internally to simulate PDAs. We use the remove_useless
method to reduce the size of the converted CFG.
[16]:
m = read_csv('examples/sipser-2-14.csv')
g = to_grammar(m).remove_useless()
g
[16]:
start: (start,accept)
(q2,q3) → 0 (q2,q2) 1
(q2,q3) → 0 (q2,q3) 1
(q1,q4) → (q2,q3)
(start,accept) → (q1,q4)
(q1,q1) → (q1,q1) (q1,q1)
(q1,q4) → (q1,q1) (q1,q4)
(q1,q4) → (q1,q4) (q4,q4)
(q4,q4) → (q4,q4) (q4,q4)
(q3,q3) → (q3,q3) (q3,q3)
(q2,q3) → (q2,q3) (q3,q3)
(q2,q3) → (q2,q2) (q2,q3)
(q2,q2) → (q2,q2) (q2,q2)
(q1,q1) → ε
(q4,q4) → ε
(q3,q3) → ε
(q2,q2) → ε