In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though the scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyzes the syntax of programming languages, web pages, and so forth.
Applications
A lexer forms the first phase of a compiler frontend in modern processing. Analysis generally occurs in one pass.
In older languages such as ALGOL, the initial stage was instead line reconstruction, which performed unstropping and removed whitespace and comments (and had scanner fewer parsers, with no separate lexer). These steps are now done as part of the lexer.
Lexers and parsers are most often used for compilers but can be used for other computer language tools, such as prettyprinters or linters. Lexing can be divided into two stages: the scanning, which segments the input string into syntactic units called lexemes and categorizes these into token classes; and the evaluating, which converts lexemes into processed values.