Home > Enterprise >  Multi line comment flex lex
Multi line comment flex lex

Time:03-04

I'm trying to make a multiline comment with this conditions:

  • Starts with ''' and finish with '''
  • Can't contain exactly three ''' inside, example:
    ''''''            Correct  
    ''''''''          Correct  
    '''a'a''a''''a''' Correct  
    '''''''''         Incorrect  
    '''a'''a'''       Incorrect
    

This is my aproximation but I'm not able to make the correct expression for this:

'''([^']|'[^']|''[^']|'''' [^'])*''' 

CodePudding user response:

The easy solution is to use a start condition. (Note that this doesn't pass on all your test cases, because I think the problem description is ambiguous. See below.)

In the following, I assume that you want to return the matched token, and that you are using a yacc/bison-generated parser which includes char* str as one of the union types. The start-condition block is a Flex extension; in the unlikely event that you're using some other lex derivative, you'll need to write out the patterms one per line, each one with the <SC_TRIPLE_QUOTE> prefix (and no space between that and the pattern).

%x SC_TRIPLE_QUOTE

%%

'''        { BEGIN(TRIPLE_QUOTE); }

<TRIPLE_QUOTE>{
  '''      { yylval.str = strndup(yytext, yyleng - 3);
             BEGIN(INITIAL);
             return STRING_LITERAL;
           }
  [^']     |
  ''?/[^'] |
  ''''     { yymore(); }
  <<EOF>>  { yyerror("Unterminated triple-quoted string");
             return 0;
           }
}

I hope that's more or less self-explanatory. The four patterns inside the start condition match the ''' terminator, any sequence of characters other than ', no more than two ', and at least four '. yymore() causes the respective matches to be accumulated. The call to strndup excludes the delimiter.

Note:

The above code won't provide what you expect from the second example, because I don't think it is possible (or, alternatively, you need to be clearer about which of the two possible analyses is correct, and why). Consider the possible comments:

'''a'''
'''a''''
'''a''''a'''

According to your description (and your third example), the third one should match, with the internal value a''''a, because '''' is more than three quotes. But according to your second example (slightly modified), the second one should match, with the internal value ', because the final ''' is taken as a terminator. The question is, how are these two possible interpretations supposed to be distinguished? In other words, what clue does the lexical scanner have that in the second one case, the token ends at ''' while in the third one it doesn't? Since both of these are part of an input stream, there could be arbitrary text following. And since these are supposed multi-line comments, there's no apriori reason to believe that the newline character isn't part of the token.

So I made an arbitrary choice about which interpretation to choose. I could have made the other arbitrary choice, but then a different example wouldn't work.

  • Related