[//]: Copyright (c) 2017-2026 Juancarlo Añez (apalala@gmail.com) [//]: SPDX-License-Identifier: BSD-4-Clause
v5.19.0 Refactoring and Modernization#
Grammar Syntax#
The
$->(EOL) expression was introduced in the grammar language to match and consume the whitespace up to and including the next line break, using the Python semantics ofos.linesep. The match interprets whitespace using the Python definition as implemented bystr.isspace(), so beware when a particular definition of whitespace is part of the language to parse.The
@nostakdecorator for rules was added to the grammar. The setting hints the tracer and error handler that the rule should not be part of the call stack. The setting is useful to avoid noise in traces when low-level rules (like those for qualified or attributed identifiers) form their own small hierarchy.The file extension for TatSu grammars is now
.ebnf. The grammar language is, after all, an extension of the most known forms of EBNF syntax. Syntax highlighters may recognize the extension better than the previous.tatsu.
Command Line#
The CLI tool now has a
--jsonoption to produce the JSON version of the model for a grammar. Re-importing of a JSON model is not yet implemented in TatSu, but TieXiu uses them successfully as the fast way to import a TatSu grammar model.
Generated Parsers#
The benchmark in
tatsu.tool.benchwas used over several large grammars and large input sets to evaluate parser strategies. The result is that there is a 1.3x performance advantage in generating a Python program versus using the in-memory model of the parsed TatSu grammar for parsing. In tests with complex projects (Java) the performance difference is not perceivable. The codspeed benchmark that runs with unit tests on GitHub doesn’t see the performance difference either.Now TatSu uses for bootstrap a module that loads its own grammar model as the main parser (the one used by
tatsu.compile()). The previous kind of parser can still be generated withtatsu.to_python_sourcecode(), which remains well tested in several unit tests. The new model-based kind of parser can be generated withtatsu.to_parsermodel_sourcecode().Note that you don’t need to generate any source code for a parser in your own projects. TatSu does generate a module to make it faster to bootstrap a parser from its own grammar. In your projects you can run the usual steps to have a performant parser:
import tatsu grammartext = ... model = tatsu.compile(grammartext, asmodel=True) output = model.parse(input)
Generating a module with classes for the type definitions in the grammar is still useful.
from pathlib import Path import tatsu grammartext = ... sourcecode = tatsu.to_python_model(grammartext) Path('./modelclases.py').write_text(sourcecode)
Optimizations in the parser logic produce parsing speeds comparable to those of TatSu v5.16 with any parsing strategy (model or generated code).
The old parser and model generator modules in
tatsu.codegenhave been deleted. Using pyrefly revealed that they are both incorrect and non-working. Their defunctness was caused by the lack of unit tests and their lack of use sincetatsu.ngcodegenwas introduced several years ago. The helper modulescodegen.cgbaseandcodegen.renderingremain in case any old projects use them for their own code generation.
Implementation#
A new
@statescopecontext manager takes care of handling the state stack in most cases.Lookaheads are always memoized. Configuration settings for disabling it have been deprecated and disabled.
A new
PaserConfig.perlinememos: floatconfiguration sets a(perlinememos * linecount)bound on the total number of memoization entries that are allowed on each parse.Introduced
objectmodel.ctx.CanParse(Protocol)defining theparse()method for entry point to parsing.An important refactoring was done to get rid of the legacy names “tokenizing” and “tokenizer” which didn’t abide to theory and practice of parsing. Now the names are
tatsu.input,tatsu.input.text, andtatsu.input.text.Text. The old names are still available as legacy for backwards compatibility.Rule includes (
RuleInclude) kept an atcutal copy of the included rule in the model. To preserve consistent semantics, the only mentions ofRulein a model are at the top-level, inGrammar.rulesandGrammar.rulemap.Grammar models that haven’t been compiled from a grammar but instead loaded from the JSON or Python representations need to be analyzed for left recursion and resolution of
CallandRuleIncludenodes. A newGrammar.analyzed: boolattribute was added to quickly check if a grammar model from any source has already been analyzed. The markers for left-recursion are persisted to Python model and JSON representations, but cross-reference resolution of rules must be performed before parsing.
Deprecations#
Support for
#includein grammars has been dropped. It was always a bad idea. Text-to-text preprocessing doesn’t belong in the grammar in part because it doesn’t apply to input sources that are not text, like that of tokenizers or streams. The classtatsu.input.buffer.Bufferstill has all the infrastrucure for supporting C-style or COBOL-style textual includes, and its definition ofBufferCursorhonors it.Bufferkeeps track of which file was the source of each line of input, something essential for good error reporting. During compilation of grammar text to aGrammarobject, the grammar text is the parser’s input, so theCursorsemantics regarding the parsing still apply.The
g2eexample in./examples/g2ewas removed. The example had become irrelevant now that the new PEG parser in Python uses a pegen-style grammar for the language that is less than a 1000 lines long. The TatSu grammar for ANTLR in./examples/g2e/antlr.tatsucan still parse ANTLR grammars, but there’s no test case for it. The semantics ing2e.semanrics.ANTLRSemanticstry to do everything on a single pass (like substituting simple TOKEN rules by their value), when transformation of the parsed input grammar model should be more stable and easier to understand with a simplerr approach.There’s no longer a separate stack for the state of
cut. The state ofcutis kept in the general state stack.