Backing up is costly in flex?-CodePudding

This is an excerpt from the generated c file in flex (modified / reformatted a little for readability).

yy_find_action:
        yy_act = yy_current_state[-1].yy_nxt;

        /* YY_DO_BEFORE_ACTION */
        yyg->yytext_ptr = yy_bp;
        yyg->yytext_ptr -= yyg->yy_more_len;
        yyleng = (int) (yy_cp - yyg->yytext_ptr);
        yyg->yy_hold_char = *yy_cp;
        *yy_cp = '\0';
        yyg->yy_c_buf_p = yy_cp;

        // this condition will always be false when yy_act = 0
        if ( yy_act != YY_END_OF_BUFFER && yy_rule_can_match_eol[yy_act] )
            {
            ...
            }

do_action:
        switch ( yy_act )
            {
            case 0: /* must back up */

            /* undo the effects of YY_DO_BEFORE_ACTION */
            *yy_cp = yyg->yy_hold_char;
            yy_cp = yyg->yy_last_accepting_cpos   1;
            yy_current_state = yyg->yy_last_accepting_state;
            goto yy_find_action;

            ...
            }

In the above code, first, flex is getting the action (yy_act). then it is setting up for executing the action (YY_DO_BEFORE_ACTION). Then it is checking if the action is 'backup' (case 0). Then it is again undoing and redoing all the setup.

If you see the Flex documentation on performance, backing up has 3rd highest performance impact on the generated scanner. Doesn't this performance improve if checking for backup is done before setting up? Why the generated code is doing it this way that I can't see any benefit of?

CodePudding user response：

Flex is written around the principle, "optimise for the common case". That sometimes leads to deliberately pessimising uncommon cases in a tradeoff, because the small gain on uncommon cases far outweighs the cost of a situation which rarely occurs.

Most lexers have one or two states which will require backup but it's actually pretty rare. For example, C requires backup in the case that the input contains .. not followed by another ., but that practically never actually occurs in a real program. The most common (or least uncommon) situation in which backup is necessary is when the buffer needs to be refilled in the middle of a token, something that will happen once every 16K of input. That might be 1000 tokens or so.

So which costs more: a 1000 failed tests or less than a dozen instructions to terminate a token and then undo it? Modern branch prediction probably reduces the cost of the failed tests considerably (much more than was typical when that code was written) but even so it doesn't seem like it would be a win. The alternative of including the token finalization in every other action might be faster, but it significantly increases the size of the code, which also has an execution cost.

Also, saying that "backing up has 3rd highest performance impact on the generated scanner" is a little misleading. The first two items on the list have a large performance impact; the impact if the remaining items is negligible.

The largest impact for scanners which might backup is that it forces the lexer to save the current state and cursor for every character which might be the end of a token; that cost needs to be paid whether or not the current token will eventually require backup. So it has little to do with the code you pasted, which only runs once for every token. In other words, the cost is not the backup itself, but rather the bookkeeping needed to make backing up possible.

All the same, you should try to avoid writing rules which might produce large backup (including rules with unbounded trailing context) because that can result in excess scanning. In pathological cases, it can result in quadratic scanning time, although that's unlikely to show up in a practical grammar. (An example would be the pair of rules -*: and - which will take quadratic time to analyse an input consisting of a large number of - followed by something other than a colon.

A more common case with unbounded backup is rules such as ["]([^"\\\n]|\\(.|\n))*["] (which matches a quoted string literal). In the case of an unterminated string literal, this will cause a backup to the initial ", which aside from possible inefficiency, really is not what you want. Better to add another rule like ["]([^"\\\n]|\\(.|\n))*\\?, which will match unterminated literals without backup.