Home > Software design >  Flex matching wrong rule
Flex matching wrong rule

Time:10-10

I am writing a compiler for the mini-L language. I am pretty much done with the scanning part. This is where you basically find tokens in the input. My problem is, when I type in "function" it matches it as an identifier. I cant really tell whats wrong.

ident           [a-zA-Z][a-zA-Z0-9_] [^_]

%{
#include <stdio.h>
#include <stdlib.h>

int col = 0;
int row = 0;
%}
%%

"function"      {printf("FUNCTION\n"); col  = yyleng;}
{ident}    {printf("IDENT %s\n", yytext); col  = yyleng;}
.               {printf("Error at line %d, column %d: unrecognized symbol \"%s\"\n", row, col, yytext);}

%%
int main(int argc, char** argv){
    
    if(argc > 1){
        FILE *fp = fopen(argv[1], "r"); 
        if(fp){
            yyin = fp;
        }
    }

    printf("Give me the input:\n");
    yylex();
}

I tried rearranging the rules but that didn't work and that doesn't seem like the problem anyway. I think it's a problem with my regex. Any help is much appreciated.

CodePudding user response:

So here's your ident pattern: [a-zA-Z][a-zA-Z0-9_] [^_]. Let's break that down:

  • [a-zA-Z] A single letter.
  • [a-zA-Z0-9_] One or more ( ) of a letter, a digit, or a _
  • [^_] Anything other than _. Anything. Such as a space.

One simple observation is that the pattern can't match fewer than three characters. On the other hand, the last one could be almost anything. It could be a letter. But it could also be a comma, or a space, or even a newline. Just not an underline.

So that won't match x, which is usually considered a valid identifier. But it will match function . (That is, the word function followed by a space.) Since that is longer than function, the identifier pattern will win unless you happened to write function_.

So that's probably not what you wanted.

Note that if you want to avoid identifiers ending with an underscore, you have to deal with two issues.

  1. The obvious pattern [[:alpha:]][[:alnum:]_]*[[:alnum:]] can't match single letters. So you need [[:alpha:]]([[:alnum:]_]*[[:alnum:]])?.

  2. If your identifier pattern doesn't match a trailing underscore, the underscore will be left over for the next token. Maybe you're OK with that. But it's probably better to match the underscore and issue an error message. Or reconsider the restriction.

Note: The pattern syntax is documented in the Flex manual, including the built in character sets I used above.

  • Related