Parsing a single quote-delimited string with two sequential single quotes inside-CodePudding

I am writing an (f)lex/bison/C parser for a language in which strings are enclosed in single quotation marks. A quotation mark within a string is duplicated. E.g., the following is a valid string: 'Hello, ''world''!'. I am not in a position to change the language syntax.

I was able to tokenize strings in (f)lex by using start conditions:

%x STRING
%%
' BEGIN(STRING);
<STRING>' BEGIN(INITIAL);
<STRING>'' { /* return a QUOTE token */ }
<STRING>[^']  { /* return a STRING_PART token */ }
%%

and delegate the final string assembly to the Bison parser. However, this does not look right to me. So, the question:

Is there a way to tokenize a string using (f)lex alone?

CodePudding user response：

I think you can simplify it to:

Char        [^']|\'\'
STRING      \'{Char}*\'

%%

{STRING}    return 280;

%%

Explanation:

Char        [^']|\'\'

A Char is either not a quote [^'] or a double quote \'\'.

STRING      \'{Char}*\'

A STRING is then a 0 or more {Char} inside quotes.

%{
#include <stdio.h>
%}
%option noyywrap

Char        [^']|\'\'
STRING      \'{Char}*\'
Space       [ \t]

%%

{STRING}    return 280;
{Space}     /* ignore */
.           return 290;

%%

int main()
{
    int l;
    while ((l = yylex()) != 0) {
        fprintf(stdout, "C: %i\n", l);
        if (l == 280) {
            int loop;
            fprintf(stdout, "L: >");
            for (loop = 0; loop < yyleng;   loop) {
                fprintf(stdout, "%c", yytext[loop]);
            }
            fprintf(stdout, "<\n");
        }
    }

    return 0;
}

Checking:

> flex string.l;gcc *.c;./a.out
'plop'   'this is a test'  'Stuff in the middle '' '
C: 280
L: >'plop'<
C: 280
L: >'this is a test'<
C: 280
L: >'Stuff in the middle '' '<

CodePudding user response：

You can certainly reduce this to a simple flex pattern, as indicated in your comment to another answer:

'([^']|'')*'  { yylval.str = strndup(yytext 1, yyleng-2); return QUOTE; }

But that doesn't transform the internal sequential quotes into single quotes, which probably means that you're going to have to rescan that string. That's a bit of a waste of cycles, particularly if you consider that most strings don't include sequential quotes.

So one possibility is to use two patterns, one for simple strings and one for strings which need reprocessing:

'[^']*'       { yylval.str = strndup(yytext 1, yyleng-2); return QUOTE; }
'([^']|'')*'  { yylval.str = fix_quotes(yytext 1, yyleng-2); return QUOTE; }

That still duplicates the scan of literals with doubled single quotes, but those should be relatively rare.

None of this deals gracefully with unterminated string literals, an important aspect of lexing which is often missed. If the input includes an unterminated string literal, the scan will continue up to the end of the program input, reading the entire input into the token buffer. Then it will fail, causing the lexer to back up, to the very beginning of the string if it doesn't include a doubled quote, or to the last doubled quote. Either way, the parse is unlikely to recover, since it will be resuming inside a string literal. If you don't allow strings to contain newline characters, you can limit the damage by using [^'\n] instead of ´[^']`. Alternatively (or additionally), you can avoid backing up into the middle of an unterminated string by adding an error pattern:

'[^']*'       { yylval.str = strndup(yytext 1, yyleng-2); return QUOTE; }
'([^']|'')*'  { yylval.str = fix_quotes(yytext 1, yyleng-2); return QUOTE; }
'([^']|'')*   { return UNTERMINATED_QUOTE; }

That sequence of patterns is ordered according to the Flex matching rules. Both the first and the second pattern will match a complete literal without doubled quotes; it's important that the first one is chosen, so it must come first. Also, the third pattern will match if the second one does, but it will match one fewer character, so the second pattern will still win. Consequently, the third pattern can only trigger if the second pattern doesn't, which can only happen if the string is unterminated.

It's also quite possible that you will want to do other transformations inside the string, like converting backslash-escapes to control characters. In that case, it's quite possible that your original style, with a start condition, turns out to be the most readable. As an additional bonus, it will completely avoid rescans, so it might also be the most efficient.