Can someone explain how tokenizing works in lexers?-CodePudding

How would you explain to someone who knows a decent amount of C what tokenizing is in a lexer? ELI5 — Explain Like I'm 5 — I'm having trouble grasping the concept because most explanations are really complicated to me.

CodePudding user response：

Tokenizing is breaking up a string in chunks.

The lexical properties of the chunks are defined by the lexer rules. Such a chunk is like a word in a sentence.

Lexer rules are simply regular expressions. I assume you know what that is.

(Advanced feature: A lexer can be put in a certain state, after seeing a certain input. These states can determine which rules are enabled/disabled.)

Each lexer rule gets an identifier called a token, typically an integer.

So the output of the lexer is a stream of tokens (integers) that represent the regular expressions it has seen in the input string.

This output also allows seeing what part of the input string was recognized for each token.

Lexer rules can overlap. Simple precedence logic applies to decide which rules will be matched. Internally a lexer (like lex or flex) has a generated state-machine that keeps track of all this.

(A next step would be feeding these tokens to a parser which has grammatical rules that define what order of tokens form valid input. But that's a whole other story.)

CodePudding user response：

In a typical lexical analyzer, the lexical analyzer has some enumeration of values that represent tokens. For example:

0 means auto.
1 means break.
2 means case.
3 means char.
…
37 means identifier.
38 means string literal.
39 means int constant.
40 means double constant.
41 means (.
42 means ).
…

When the lexical analyzer is reading a file, it examines the incoming characters and recognizes tokens as they appear. When it sees char, it stores the value 3 as a token. When it sees an identifier, such as foo, it stores the value 37 as a token and also stores the string “foo” with it.

Further, all the rules the lexical analyzer has for what forms a program may have are encoded as rules using the token values. Those rules may have been written in the languages that the lex and yacc tools accept, and those tools have built a program that processes the tokens according to those rules.

In practice, lexical analyzers in compilers are more complicated, because they cannot retain just the token value. To produce useful error messages, they also have to keep information about where the token appeared in the source code, so that the relevant line of source code can be displayed with the error message. And there are other complications.

Ultimately, tokenizing is something of an abstract practice, managing source code as chunks of text instead of individual characters, as well as being practical. Any software that groups text it reads into chunks can be said to be tokenizing its input.

CodePudding user response：

As anyone who has tried to write a C program to parse some input probably knows, doing it while reading things one character at a time can be tedious and error-prone.

There are of course many ways of reading input. You can read input a character at a time by calling getc or getchar. You can read it a line at a time by calling fgets. If your input consists of whitespace-separated "words", you could read it a word at a time by using scanf("%s").

It's been discovered that a super good way of divvying up the work when writing a parser is to arrange things so that you can read your input one token at a time. Your parser can say, "give me the next token", and it's the lexical analyzer's job, as a separate "black box" module, to scan the input and partition it into those tokens so that it can pass them to the parser on demand.

This way, if the parser is working on, say, parsing an assignment statement that looks like

<identifier> = <integer-constant> ;

the parser can grab tokens and see if it gets an identifier token followed by an = token followed by an integer constant token followed by a semicolon token. But the point is that the parser does not need to get bogged down and distracted by the problem of scanning the individual alphanumeric characters that make up an identifier, or the individual digits that make up an integer constant.

Tokenizing can also take care of multi-character operators, like C's =, -=, , --, etc.

The tokens which a lexical analyzer can return are typically:

all the language's operators
all the language's grouping and separator characters, like [ ] { } , ;
all the language's keywords, like int for while
identifiers
constants (integer, floating-point, character, string, etc.)

The lexical analyzer typically also takes care of skipping over whitespace and comments.

When you're parsing input a character at a time using getc, sometimes you read too far, and it's convenient to "back up" using ungetc. Similarly, depending on your parser, it may sometimes read too far, too, so your lexical analyzer may have a way of "ungetting" one or more previously-read tokens.

Typically, each one of these token types is represented by a #define, or an enum constant. So, in source code, the tokens are typically things like EQUALS and INT and IDENTIFIER and INTCONSTANT (for the = operator, the keyword int, an identifier, or an integer constant, respectively).

When the token that was just read is an identifier or a constant, there's an additional wrinkle in that there must be some way of returning extra information from the lexer. That is, if the lexer has just read an = operator, or an int keyword, it can simply return the token values EQUALS or INT. But if it has just read the identifier main, it has to have some way of returning the pair ("tuple") (IDENTIFIER, "main"). And if it has just read the integer constant 123, it has to have some way of returning the pair (INTCONSTANT, "123").