In computer science, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical (word level) and phrase level structure of the language.
Dividing processing into a lexer followed by a parser is more modular; scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include TeX, most wiki grammars, s, simple application-specific scripting languages, and Raku.
Only one metalanguage is needed
Non-regular lexical structure is handled easily
"Token classification" is unneeded which removes the need for design accommodations such as "the lexer hack" and language reserved words (such as "while" in C)
Grammars can be compositional (can be merged without human intervention)
Since the lexical scanning and syntactic parsing are combined, the resulting parser tends to be more complicated and thus harder to understand and debug. The same will hold for the associated grammar, if a grammar is used to generate the parser.
The resulting parser tends to be significantly less efficient than a lexer-parser pipeline with regard to both time and memory.
SGLR is a parser for the modular Syntax Definition Formalism (SDF), and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
JSGLR, a pure Java implementation of SGLR, also based on SDF.
TXL supports character-level parsing.
dparser generates ANSI C code for scannerless GLR parsers.
Spirit allows for both scannerless and scanner-based parsing.
SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.
Laja is a two-phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
In computer science, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical (word level) and phrase level structure of the language.
In computer programming, the lexer hack is a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer.
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding.
This course teaches the basic techniques, methodologies, and practical skills required to draw meaningful insights from a variety of data, with the help of the most acclaimed software tools in the dat
We teach the fundamental aspects of analyzing and interpreting computer languages, including the techniques to build compilers. You will build a working compiler from an elegant functional language in