v5.20.0 Update#
Sibling Projects#
There are ports of TatSu to Go and Rust. They are functionaly complete with except for features (like synthetic classes) that rely on the dynamic nature of Python.
铁修 TieXiu#
铁修 TieXiu is the port of TatSu to Rust. It features a PyO3 interface os it’s also a Python library, but the benchmarks show that the pure-Python parsers generated by TatSu are still more performant when hosting from Python. See the TieXiu README for a discussion of the performance limits of PEG parsers.
⻰OGoPEGo#
⻰OGoPEGo is the port of TatSu to Go. The implementation, being the most mature, is beutifully concise, and using the generated parsers has a simplicity closer to the Python style allowed by TatSu.
Internals#
The algorithm for left-recursion analysis went over another round of simplification and optimization. Then the analysis done in pegen, a more efficient and theoretically-sound approach, was evaluated. All tests pass with the pegen’s
SCC(Strongly Connected Componets) algorithm, so the old-and-tried algorithm in TatSu was replaced.Although left-recursion analysis is performed once per
Grammar, before any parsing, a simpler implementation makes this core part of TatSu easier to maintain.
g2e (ANTLR to TatSu)#
The
g2e(ANTLR grammar to TatSu) translator has been revived and significantly simplified. A working example (python3.tatsu, 551 lines) is generated from Python 3’s full ANTLR grammar and passestatsu.compile().Removed the regex conversion approach for ANTLR token rules. ANTLR lexer patterns (notably
\uXXXXescapes) are not viable as Python regex patterns. Non-trivial token rules now emitFail()instead ofPattern. The classes and methodsTokenPattern,_token_expr_to_regex,_token_expr_to_regex_verbose,_decode_antlr_string, and_char_to_regexhave been removed.g2esubstitutes simple token definitions (likeOPEN_PAREN : '(' {opened++;};) for their right hand side (just'(') for better looking grammars. For complex token definitions ANTLR uses a special syntax which is not that of Python-compatible (PCRE2) regular expressions, sog2eomits them, leaving it to the user to decide how to handle those tokens. In many cases a single pattern match is enough for the grammar of interest, and a semantic rule may be added to validate additional conditions that the parsed token should meet.Streamlined generated grammar output — removed unnecessary parenthesization:
Single token references in alternatives no longer wrapped in extra parens:
(NEWLINE)→NEWLINE.Groups inside
[...],{...},{...}+unwrapped:[('as' NAME)]→['as' NAME],{('.' NAME)}→{'.' NAME}.Rule deduplication by name handles
tokens {}declarations that collide with defined rules (e.g.INDENT/DEDENT).
Token name resolution now uses uppercase names consistently.
The
g2eexample (examples/g2e) uses the old, LL(1) Python grammar. Now, since Python’s PEG parser the actual grammar is a much simpler one. The example is kept as it was to demonstrateg2e’s behavior over a complex grammar.
Tools#
A new
--recursion-limit(-R1) option was added to thetatsuCLI tool so it can handle large and deeply recursive input grammars. When used as a library, the host program should callsys.setrecursionlimit()when required by the grammar complexity.Added better rendering to
FailedParse.__str__(). Now a code fragment and line numbers are shown, as in many modern tools.error: expecting 'world' --> example:1:7 | 1 | hello missing | ^ expecting 'world' -> start
JSON#
tatsu.ebnfdefine rules for JSON literals, sotrue,false, andnull, may be used where previously onlyTrue,False, andNonewere recognized. The Python literals are still honored as before, as well as thebooleanrule resolving toTruefor non-falsy values. These literals are only used in grammar directives, as parsing is only interested in the strings that match aTokenorPattern.Now a
Grammarcan be imported from the JSON produced bymodel.asjson(). Roundtrip has been tested and it works. New methodsGrammar.load(value: Any) -> GrammarandGrammar.loads(json: str) -> Grammarmake the functionality available.class Grammar: @staticmethod def load(value: Any) -> Grammar: from .json import load_grammar return load_grammar(value) @staticmethod def loads(value: str) -> Grammar: from .json import loads_grammar return loads_grammar(value)
Grammar Syntax#
The definition of the
DEDENTrule in the TatSu grammar is used to support EBNF notations with no rule-terminatiors and grammars with no blank lines * rules. The pattern used in the rule was incorrectly consuming the first non-space character starting the next rule. Fixed.Now this is a valid EBNF definition:
grammar = r""" @@grammar :: MiniJSON @@nameguard :: False @@whitespace :: /\s+/ start: value $ value: object | array | string | number | 'true' | 'false' | 'null' object: '{' members? '}' array: '[' elements? ']' members: pair (',' pair)* elements: value (',' value)* pair: string ':' value string: '"' CONTENT '"' CONTENT: /[^"]*/ number: /-?\d+(\.\d+)?/ """