Concept

Lexical analysis

Summary
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding. Rule-based programs A rule-based program, performing lexical tokenization, is called tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer forms the first phase of a compiler frontend in processing. Analysis generally occurs in one pass. Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such as prettyprinters or lin
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related publications

Loading

Related people

Loading

Related units

Loading

Related concepts

Loading

Related courses

Loading

Related lectures

Loading