Home > Back-end >  How to recognize a field name without quotation marks with a regex?
How to recognize a field name without quotation marks with a regex?

Time:08-08

I write a parser to recognize strings of this type: TERM : MATCH_TERM

I had initially allowed to write the TERM part only between single or double quotes ('TERM' or "TERM"). I would now like to allow the TERM part to be written without quotes around it.

My regexes for recognizing the TERM part surrounded by quotes work well and are as follows:

const StringDoubleQuote = createToken({ name: "StringDoubleQuote", pattern: /"[^"\\]*(?:\\.[^"\\]*)*"/ });
const StringSimpleQuote = createToken({ name: "StringSimpleQuote", pattern: /'[^'\\]*(?:\\.[^'\\]*)*'/ });

For these regular expressions I didn't need to specify the : character as marking the end of the token because the quote mark already had that usage.

To make it possible to write the TERM part without quotation marks around it, I used the following regular expression:

const StringWithoutQuote = createToken({ name: "StringWithoutQuote", pattern: /[\w!@#\$%\^&.-] / });

To define the lexer part of the parser created with chevrotain I wrote the following code which defines the possible tokens:

const StringWithoutQuote = createToken({ name: "StringWithoutQuote", pattern: /[\w!@#\$%\^&.-] / });
const StringDoubleQuote = createToken({ name: "StringDoubleQuote", pattern: /"[^"\\]*(?:\\.[^"\\]*)*"/ });
const StringSimpleQuote = createToken({ name: "StringSimpleQuote", pattern: /'[^'\\]*(?:\\.[^'\\]*)*'/ });
const And = createToken({ name: "And", pattern: /(AND|and)/ });
const Or = createToken({ name: "Or", pattern: /(OR|or)/ });
const WhiteSpace = createToken({
    name: "WhiteSpace",
    pattern: /[ \t\n\r] /,
    group: Lexer.SKIPPED
});
const Colon = createToken({ name: "Colon", pattern: /:/ });
const Star = createToken({ name: "Star", pattern: /\*/ });

//.... Many other token

const allTokens = [
    WhiteSpace,
    Colon,
    Star,
    And,
    Or,

    //At this level I also tried to put StringWithoutQuote after the string tokens with quotes in the allTokens array

    StringWithoutQuote,
    StringDoubleQuote,
    StringSimpleQuote
];

My problem now: I'll take two example strings:

  • aggregateType:*

  • orderInfo.orderDate:*

For the first string (aggregateType:*) whatever the syntax used (with or without quotes) the parser works well and returns the expected result.

But for the second string (orderInfo.orderDate:), the syntax with quotes ('orderInfo.orderDate': or "orderInfo.orderDate":*) allows the parser to work well and return me the expected result.

But with the syntax without quotes (orderInfo.orderDate:*), the parser returns me the following error: Error: Failing to parse of string <orderInfo.orderDate:*>. I'm not sure, but I really feel like it's the addition of a dot . in the TERM part that's causing the error. However, in my regex (/[\w!@#\$%\^&.-] /) I did put that the dot . is one of my special characters to be taken into account in the token.

Does anyone see what I did wrong that is causing this behavior?

Thanks in advance if you take the time to help me.

CodePudding user response:

Finally the solution was to define a "longer possible pattern" that allows to say each time I check the presence of this keyword (in my case and or or) I also check if it is not a less specific identifier (in my case StringWithoutQuote).

The more detailed explanation is at the end of the constructor paragraph of this link more detailed explanation in documentation

The definition of the lexer thus becomes:

const StringWithoutQuote = createToken({ name: "StringWithoutQuote", pattern: /[\w!@#\$%\^&.-] / });
const StringDoubleQuote = createToken({ name: "StringDoubleQuote", pattern: /"[^"\\]*(?:\\.[^"\\]*)*"/ });
const StringSimpleQuote = createToken({ name: "StringSimpleQuote", pattern: /'[^'\\]*(?:\\.[^'\\]*)*'/ });
const And = createToken({ name: "And", pattern: /(AND|and)/, longer_alt: StringWithoutQuote });  //here you need to add loger_alt property
const Or = createToken({ name: "Or", pattern: /(OR|or)/, longer_alt: StringWithoutQuote });  //here you need to add loger_alt property
const WhiteSpace = createToken({
    name: "WhiteSpace",
    pattern: /[ \t\n\r] /,
    group: Lexer.SKIPPED
});
const Colon = createToken({ name: "Colon", pattern: /:/ });
const Star = createToken({ name: "Star", pattern: /\*/ });
 
const allTokens = [
    WhiteSpace,
    Colon,
    Star,
    And,
    Or,
    StringWithoutQuote,
    StringDoubleQuote,
    StringSimpleQuote
];
  • Related