# v5.20.0 Update ## Sibling Projects There are ports of **TatSu** to `Go` and `Rust`. They are functionaly complete with except for features (like synthetic classes) that rely on the dynamic nature of Python. [铁修 TieXiu]: https://github.com/neogeny/tiexiu [⻰OGoPEGo]: https://github.com/neogeny/ogopego [PyO3]: https://pyo3.rs/v0.28.3/ [README]: https://github.com/neogeny/tiexiu/blob/main/README.md ### 铁修 TieXiu [铁修 TieXiu][] is the port of **TatSu** to `Rust`. It features a [PyO3][] interface os it's also a Python library, but the benchmarks show that the pure-Python parsers generated by **TatSu** are still more performant when hosting from Python. See the **TieXiu** [README][] for a discussion of the performance limits of PEG parsers. ### ⻰OGoPEGo [⻰OGoPEGo][] is the port of **TatSu** to `Go`. The implementation, being the most mature, is beutifully concise, and using the generated parsers has a simplicity closer to the Python style allowed by **TatSu**. ## Internals * The algorithm for left-recursion analysis went over another round of simplification and optimization. Then the analysis done in [pegen][], a more efficient and theoretically-sound approach, was evaluated. All tests pass with the [pegen][]'s `SCC` (_Strongly Connected Componets_) algorithm, so the old-and-tried algorithm in **TatSu** was replaced. Although left-recursion analysis is performed once per `Grammar`, before any parsing, a simpler implementation makes this core part of **TatSu** easier to maintain. [pegen]: https://we-like-parsers.github.io/pegen/grammar.html ## g2e (ANTLR to TatSu) * The `g2e` (ANTLR grammar to TatSu) translator has been revived and significantly simplified. A working example (`python3.tatsu`, 551 lines) is generated from Python 3's full ANTLR grammar and passes `tatsu.compile()`. * Removed the regex conversion approach for ANTLR token rules. ANTLR lexer patterns (notably `\uXXXX` escapes) are not viable as Python regex patterns. Non-trivial token rules now emit `Fail()` instead of `Pattern`. The classes and methods `TokenPattern`, `_token_expr_to_regex`, `_token_expr_to_regex_verbose`, `_decode_antlr_string`, and `_char_to_regex` have been removed. * `g2e` substitutes simple token definitions (like `OPEN_PAREN : '(' {opened++;};`) for their right hand side (just `'('`) for better looking grammars. For complex token definitions ANTLR uses a special syntax which is not that of Python-compatible (PCRE2) regular expressions, so `g2e` omits them, leaving it to the user to decide how to handle those tokens. In many cases a single pattern match is enough for the grammar of interest, and a semantic rule may be added to validate additional conditions that the parsed token should meet. * Streamlined generated grammar output — removed unnecessary parenthesization: - Single token references in alternatives no longer wrapped in extra parens: `(NEWLINE)` → `NEWLINE`. - Groups inside `[...]`, `{...}`, `{...}+` unwrapped: `[('as' NAME)]` → `['as' NAME]`, `{('.' NAME)}` → `{'.' NAME}`. - Rule deduplication by name handles `tokens {}` declarations that collide with defined rules (e.g. `INDENT`/`DEDENT`). * Token name resolution now uses uppercase names consistently. * The `g2e` example (`examples/g2e`) uses the _old_, LL(1) Python grammar. Now, since Python's PEG parser the actual grammar is a much simpler one. The example is kept as it was to demonstrate `g2e`'s behavior over a complex grammar. ## Tools * A new `--recursion-limit` (`-R1`) option was added to the `tatsu` CLI tool so it can handle large and deeply recursive input grammars. When used as a library, the host program should call `sys.setrecursionlimit()` when required by the grammar complexity. * Added better rendering to `FailedParse.__str__()`. Now a code fragment and line numbers are shown, as in many modern tools. ```console error: expecting 'world' --> example:1:7 | 1 | hello missing | ^ expecting 'world' -> start ``` ## JSON * `tatsu.ebnf` define rules for JSON literals, so `true`, `false`, and `null`, may be used where previously only `True`, `False`, and `None` were recognized. The Python literals are still honored as before, as well as the `boolean` rule resolving to `True` for non-falsy values. These literals are only used in grammar directives, as parsing is only interested in the strings that match a `Token` or `Pattern`. * Now a `Grammar` can be imported from the JSON produced by `model.asjson()`. Roundtrip has been tested and it works. New methods `Grammar.load(value: Any) -> Grammar` and `Grammar.loads(json: str) -> Grammar` make the functionality available. ```python class Grammar: @staticmethod def load(value: Any) -> Grammar: from .json import load_grammar return load_grammar(value) @staticmethod def loads(value: str) -> Grammar: from .json import loads_grammar return loads_grammar(value) ``` ## Grammar Syntax * The definition of the `DEDENT` rule in the **TatSu** grammar is used to support EBNF notations with no rule-terminatiors and grammars with no blank lines * rules. The pattern used in the rule was incorrectly consuming the first non-space character starting the next rule. Fixed. Now this is a valid EBNF definition: ```python grammar = r""" @@grammar :: MiniJSON @@nameguard :: False @@whitespace :: /\s+/ start: value $ value: object | array | string | number | 'true' | 'false' | 'null' object: '{' members? '}' array: '[' elements? ']' members: pair (',' pair)* elements: value (',' value)* pair: string ':' value string: '"' CONTENT '"' CONTENT: /[^"]*/ number: /-?\d+(\.\d+)?/ """ ```