My code is unable to detect the usage of operators along with non-english characters:
const OPERATOR_REGEX = new RegExp(
/(?!\B"[^"|“|”]*)\b(and|or|not|exclude)(?=.*[\s])\b(?![^"|“|”]*"\B)/,
'giu'
);
const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';
console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));
https://codepen.io/thewebtud/pen/vYraavd?editors=1111
Whereas the same code successfully detects all operators on regex101.com using the unicode flag: https://regex101.com/r/FC84BH/1
How can this be fixed for JS?
CodePudding user response:
Keeping in mind that
\b
(word boundary) can be written as(?:(?<=^)(?=\w)|(?<=\w)(?=$)|(?<=\W)(?=\w)|(?<=\w)(?=\W))
and\B
(non-word boundary) can be written as(?:(?<=^)(?=\W)|(?<=\W)(?=$)|(?<=\W)(?=\W)|(?<=\w)(?=\w))
and that a Unicode-aware \w
pattern is [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
(see Replace certain arabic words in text string using Javascript), here is the ECMAScript 2018 solution:
const w = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const nw = String.raw`[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const uwb = String.raw`(?:(?<=^)(?=${w})|(?<=${w})(?=$)|(?<=${nw})(?=${w})|(?<=${w})(?=${nw}))`;
const unwb = String.raw`(?:(?<=^)(?=${nw})|(?<=${nw})(?=$)|(?<=${nw})(?=${nw})|(?<=${w})(?=${w}))`;
const OPERATOR_REGEX = new RegExp(
String.raw`(?!${unwb}"[^"“”]*)${uwb}(and|or|not|exclude)(?=.*\s)${uwb}(?![^"“”]*"${unwb})`,
'giu'
);
const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';
console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));