v5.20.0 Update#

Sibling Projects#

There are ports of TatSu to Go and Rust. They are functionaly complete with except for features (like synthetic classes) that rely on the dynamic nature of Python.

铁修 TieXiu#

铁修 TieXiu is the port of TatSu to Rust. It features a PyO3 interface os it’s also a Python library, but the benchmarks show that the pure-Python parsers generated by TatSu are still more performant when hosting from Python. See the TieXiu README for a discussion of the performance limits of PEG parsers.

⻰OGoPEGo#

⻰OGoPEGo is the port of TatSu to Go. The implementation, being the most mature, is beutifully concise, and using the generated parsers has a simplicity closer to the Python style allowed by TatSu.

Internals#

  • The algorithm for left-recursion analysis went over another round of simplification and optimization. Then the analysis done in pegen, a more efficient and theoretically-sound approach, was evaluated. All tests pass with the pegen’s SCC (Strongly Connected Componets) algorithm, so the old-and-tried algorithm in TatSu was replaced.

    Although left-recursion analysis is performed once per Grammar, before any parsing, a simpler implementation makes this core part of TatSu easier to maintain.

g2e (ANTLR to TatSu)#

  • The g2e (ANTLR grammar to TatSu) translator has been revived and significantly simplified. A working example (python3.tatsu, 551 lines) is generated from Python 3’s full ANTLR grammar and passes tatsu.compile().

  • Removed the regex conversion approach for ANTLR token rules. ANTLR lexer patterns (notably \uXXXX escapes) are not viable as Python regex patterns. Non-trivial token rules now emit Fail() instead of Pattern. The classes and methods TokenPattern, _token_expr_to_regex, _token_expr_to_regex_verbose, _decode_antlr_string, and _char_to_regex have been removed.

  • g2e substitutes simple token definitions (like OPEN_PAREN : '(' {opened++;};) for their right hand side (just '(') for better looking grammars. For complex token definitions ANTLR uses a special syntax which is not that of Python-compatible (PCRE2) regular expressions, so g2e omits them, leaving it to the user to decide how to handle those tokens. In many cases a single pattern match is enough for the grammar of interest, and a semantic rule may be added to validate additional conditions that the parsed token should meet.

  • Streamlined generated grammar output — removed unnecessary parenthesization:

    • Single token references in alternatives no longer wrapped in extra parens: (NEWLINE)NEWLINE.

    • Groups inside [...], {...}, {...}+ unwrapped: [('as' NAME)]['as' NAME], {('.' NAME)}{'.' NAME}.

    • Rule deduplication by name handles tokens {} declarations that collide with defined rules (e.g. INDENT/DEDENT).

  • Token name resolution now uses uppercase names consistently.

  • The g2e example (examples/g2e) uses the old, LL(1) Python grammar. Now, since Python’s PEG parser the actual grammar is a much simpler one. The example is kept as it was to demonstrate g2e’s behavior over a complex grammar.

Tools#

  • A new --recursion-limit (-R1) option was added to the tatsu CLI tool so it can handle large and deeply recursive input grammars. When used as a library, the host program should call sys.setrecursionlimit() when required by the grammar complexity.

  • Added better rendering to FailedParse.__str__(). Now a code fragment and line numbers are shown, as in many modern tools.

    error: expecting 'world'
      --> example:1:7
       |
     1 | hello missing
       |       ^ expecting 'world'
    
      -> start
    

JSON#

  • tatsu.ebnf define rules for JSON literals, so true, false, and null, may be used where previously only True, False, and None were recognized. The Python literals are still honored as before, as well as the boolean rule resolving to True for non-falsy values. These literals are only used in grammar directives, as parsing is only interested in the strings that match a Token or Pattern.

  • Now a Grammar can be imported from the JSON produced by model.asjson(). Roundtrip has been tested and it works. New methods Grammar.load(value: Any) -> Grammar and Grammar.loads(json: str) -> Grammar make the functionality available.

    class Grammar:
        @staticmethod
        def load(value: Any) -> Grammar:
            from .json import load_grammar
            return load_grammar(value)
    
        @staticmethod
        def loads(value: str) -> Grammar:
            from .json import loads_grammar
    
            return loads_grammar(value)
    

Grammar Syntax#

  • The definition of the DEDENT rule in the TatSu grammar is used to support EBNF notations with no rule-terminatiors and grammars with no blank lines * rules. The pattern used in the rule was incorrectly consuming the first non-space character starting the next rule. Fixed.

    Now this is a valid EBNF definition:

    grammar = r"""
        @@grammar :: MiniJSON
        @@nameguard :: False
        @@whitespace :: /\s+/
        start: value $
    
        value: object | array | string | number | 'true' | 'false' | 'null'
    
        object: '{' members? '}'
        array: '[' elements? ']'
        members: pair (',' pair)*
        elements: value (',' value)*
        pair: string ':' value
        string: '"' CONTENT '"'
        CONTENT: /[^"]*/
        number: /-?\d+(\.\d+)?/
    """