I have tokens that I need to parse and would like to have a regex that captures them. Here is how the tokens look
EVAL
INPUT A = 5;
INPUT B = 6;
...
INPUT LongVariableName = 10;
I want to validate that every EVAL
block is formatted correctly for parsing. A naive approach I have is to take these tokens and build a string out of them like so:
"EVAL:INPUT A = 5;#...#INPUT EXAMPLE = 10;#"
where we signify that we have an EVAL
with a :
separating it from the inputs. Each input is then delimited by a #
.
Or getting into the spirit of regex, EVAL:(INPUT ID = NUM;#)
,
- where
ID
begins with an alphabetic character but can contain any amount of alphanumeric characters after, and NUM
is any nonnegative integer.- there must always be at least one INPUT.
I want this regex to make sure that within the string that we build, each input is instantiated (with INPUT
), named (with ID
), assigned a value and terminated with = NUM;
, and delimited correctly (with #
).
I am however having trouble how to design the regex so that we are able to capture variable names of any length greater than one.
CodePudding user response:
For validation you can match the string to the following regular expression to determine which blocks are formatted correctly:
^EVAL *\r?\n(?:INPUT [a-z][a-z\d]* = \d ; *\r?\n)
For formatting replacing matches of the following regular expression with a space will get you close:
\r?\n(?!EVAL\b)
If the string looks like this:
EVAL
INPUT A = 5;
INPUT B = 6;
INPUT LongVariableName = 10;
EVAL
INPUT Dog = 55;
INPUT Cat9Lives = 6;
the following string will be result:
EVAL INPUT A = 5; INPUT B = 6; INPUT LongVariableName = 10;
EVAL INPUT Dog = 55; INPUT Cat9Lives = 6;
CodePudding user response:
I am however having trouble how to design the regex so that we are able to capture variable names of any length greater than one
I suggest that you construct a state machine for this. You could define states in which your lexer is currently in and based on that process the strings.
You could use a loop and inside you construct several if
statements to detect the state of the lexer and have the corresponding action in it.
So initially your lexer's state would be set to read #
followed by a string INPUT
then after reading it the lexer's state would be set to read ID
. For this you could use sub-states (say readID
), i.e initially the lexer's readID
state would be set to a constant READ_FIRST_CHAR
(that you define as a macro or enum) which would be any of the alphabetic character, right after reading the first character of ID
you set the readID
state to read READ_INTERIOR_CHAR
, which is any alphanumeric character.
I hope you get the idea. The language you defined is very simple and I don't think you would need a parser for it.