<!--
Copyright (c) 2017-2026 Juancarlo Añez (apalala@gmail.com)
SPDX-License-Identifier: BSD-4-Clause
-->

# v5.20.0 Update

## Sibling Projects

There are ports of **TatSu** to `Go` and `Rust`. They are functionaly complete with except for features (like synthetic classes) that rely on the dynamic nature of Python.

[铁修 TieXiu]: https://github.com/neogeny/tiexiu
[⻰OGoPEGo]: https://github.com/neogeny/ogopego 
[PyO3]: https://pyo3.rs/v0.28.3/
[README]: https://github.com/neogeny/tiexiu/blob/main/README.md

### 铁修 TieXiu 

[铁修 TieXiu][] is the port of **TatSu** to `Rust`. It features a [PyO3][] interface os it's also a Python library, but the benchmarks show that the pure-Python parsers generated by **TatSu** are still more performant when hosting from Python. See the **TieXiu** [README][] for a discussion of the performance limits of PEG parsers.

### ⻰OGoPEGo

[⻰OGoPEGo][] is the port of **TatSu** to `Go`. The implementation, being the most mature, is beutifully concise, and using the generated parsers has a simplicity closer to the Python style allowed by **TatSu**.


## Internals

* The algorithm for left-recursion analysis went over another round of simplification and optimization. Then the analysis done in [pegen][], a more efficient and theoretically-sound approach, was evaluated. All tests pass with the [pegen][]'s `SCC` (_Strongly Connected Componets_) algorithm, so the old-and-tried algorithm in **TatSu** was replaced.

  Although left-recursion analysis is performed once per `Grammar`, before any parsing, a simpler implementation makes this core part of **TatSu** easier to maintain.

[pegen]: https://we-like-parsers.github.io/pegen/grammar.html

## g2e (ANTLR to TatSu)

* The `g2e` (ANTLR grammar to TatSu) translator has been revived and significantly simplified. A working example (`python3.tatsu`, 551 lines) is generated from Python 3's full ANTLR grammar and passes `tatsu.compile()`.

* Removed the regex conversion approach for ANTLR token rules. ANTLR lexer patterns (notably `\uXXXX` escapes) are not viable as Python regex patterns. Non-trivial token rules now emit `Fail()` instead of `Pattern`. The classes and methods `TokenPattern`, `_token_expr_to_regex`, `_token_expr_to_regex_verbose`, `_decode_antlr_string`, and `_char_to_regex` have been removed.

* `g2e` substitutes simple token definitions (like `OPEN_PAREN : '(' {opened++;};`) for their right hand side (just `'('`) for better looking grammars. For complex token definitions ANTLR uses a special syntax which is not that of Python-compatible (PCRE2) regular expressions, so `g2e` omits them, leaving it to the user to decide how to handle those tokens. In many cases a single pattern match is enough for the grammar of interest, and a semantic rule may be added to validate additional conditions that the parsed token should meet.

* Streamlined generated grammar output — removed unnecessary parenthesization:
  - Single token references in alternatives no longer wrapped in extra parens: `(NEWLINE)` → `NEWLINE`.
  - Groups inside `[...]`, `{...}`, `{...}+` unwrapped: `[('as' NAME)]` → `['as' NAME]`, `{('.' NAME)}` → `{'.' NAME}`.
  - Rule deduplication by name handles `tokens {}` declarations that collide with defined rules (e.g. `INDENT`/`DEDENT`).

* Token name resolution now uses uppercase names consistently.

* The `g2e` example (`examples/g2e`) uses the _old_, LL(1) Python grammar. Now, since Python's PEG parser the actual grammar is a much simpler one. The example is kept as it was to demonstrate `g2e`'s behavior over a complex grammar.

## Tools

* A new `--recursion-limit` (`-R1`) option was added to the `tatsu` CLI tool so it can handle large and deeply recursive input grammars. When used as a library, the host program should call `sys.setrecursionlimit()` when required by the grammar complexity.

* Added better rendering to `FailedParse.__str__()`. Now a code fragment and line numbers are shown, as in many modern tools.

    ```console
    error: expecting 'world'
      --> example:1:7
       |
     1 | hello missing
       |       ^ expecting 'world'

      -> start
    ```

## JSON

* `tatsu.ebnf` define rules for JSON literals, so `true`, `false`, and `null`,
  may be used where previously only `True`, `False`, and `None` were recognized.
  The Python literals are still honored as before, as well as the `boolean` rule resolving to `True` for non-falsy values. These literals are only used in grammar directives, as parsing is only interested in the strings that match a `Token` or `Pattern`.

* Now a `Grammar` can be imported from the JSON produced by `model.asjson()`. Roundtrip has been tested and it works. New methods `Grammar.load(value: Any) -> Grammar` and `Grammar.loads(json: str) -> Grammar` make the functionality available.

    ```python
    class Grammar:
        @staticmethod
        def load(value: Any) -> Grammar:
            from .json import load_grammar
            return load_grammar(value)
    
        @staticmethod
        def loads(value: str) -> Grammar:
            from .json import loads_grammar
    
            return loads_grammar(value)
    ```

## Grammar Syntax

* The definition of the `DEDENT` rule in the **TatSu** grammar is used to support EBNF notations with no rule-terminatiors and grammars with no blank lines * rules. The pattern used in the rule was incorrectly consuming the first non-space character starting the next rule. Fixed.

  Now this is a valid EBNF definition:

    ```python
    grammar = r"""
        @@grammar :: MiniJSON
        @@nameguard :: False
        @@whitespace :: /\s+/
        start: value $

        value: object | array | string | number | 'true' | 'false' | 'null'

        object: '{' members? '}'
        array: '[' elements? ']'
        members: pair (',' pair)*
        elements: value (',' value)*
        pair: string ':' value
        string: '"' CONTENT '"'
        CONTENT: /[^"]*/
        number: /-?\d+(\.\d+)?/
    """
    ```