Home > OS >  Regex to extract search terms is not working as expected
Regex to extract search terms is not working as expected

Time:03-23

I have the test string ti: harry Potter OR kw: magic AND sprint: title OR ti: HARRY

and want the output as

["ti: harry Potter OR kw:", "kw: magic AND sprint:", "sprint: title OR ti:", "ti: HARRY"]

but the output I am getting is

["ti: harry Potter OR kw:", "kw: magic AND sprint:", "nt: title OR ti:", "ti: HARRY"]

It is taking only 2 characters before the colon The regex I am using is

const match = /[a-z0-9]{2}:.*?($|[a-z0-9]{2}:)/g;

and I am extracting it and putting it in an array

I tried replacing it with /[a-z0-9] :.*?($|[a-z0-9] :)/g; but when I increase index and add the strings to parsed, it does it weirdly (This is included in code as well)

I tried changing the {2} to n and that is also not working as expected.

const parsed = [];
const match = /[a-z0-9]{2}:.*?($|[a-z0-9]{2}:)/g;
const message = "ti: harry Potter OR kw: magic AND sprint: title OR ti: HARRY";
let next = match.exec(message);
while (next) {
  parsed.push(next[0]);
  match.lastIndex = next.index   1;
  next = match.exec(message);
  console.log("next again", next);
}

console.log("parsed", parsed);

https://codesandbox.io/s/regex-forked-6op514?file=/src/index.js

CodePudding user response:

For the desired matches, you might use a pattern where you would also optionally match AND or OR and get the match in capture group 1, which is denoted be m[1] in the example code.

\b(?=([a-z0-9] :.*?(?: (?:AND|OR) [a-z0-9] :|$)))

In parts, the pattern matches:

  • \b A word boundary to prevent a partial match
  • (?= Positive lookahead to assert what is on the right is
    • ( Capture group 1
      • [a-z0-9] :
      • .*? Match any char except a newline as least as possible
      • (?: Non capture group
        • (?:AND|OR) [a-z0-9] : Match either AND or OR followed by a space and 1 times a char a-z0-9 and :
        • | Or
        • $ Assert the end of the string
      • ) Close non capture group
    • ) Close group 1
  • ) Close the lookahead

See a regex demo.

const regex = /\b(?=([a-z0-9] :.*?(?: (?:AND|OR) [a-z0-9] :|$)))/gm;
const str = `ti: harry Potter OR kw: magic AND sprint: title OR ti: HARRY`;
const result = Array.from(str.matchAll(regex), m => m[1]);
console.log(result);

  • Related