I am using this code to split text into segments:
let value = "(germany or croatia) and (denmark or hungary)"
let tokens = value.split(/(?=[()\s])|(?<=[()\s])/g).filter(segment => segment.trim() != '');
This produces the following array:
['(', 'germany', 'or', 'croatia', ')', 'and', '(', 'denmark', 'or', 'hungary', ')']
How should I rewrite regex so that it would be able to split this string:
(germany*or*croatia)*and*(denmark*or*hungary)
into
['(', 'germany', '*or*', 'croatia', ')', '*and*', '(', 'denmark', '*or*', 'hungary', ')']
The problem is that split paramrs *or*
and *and*
are of multiple characters and just using
let tokens = value.split(/(?=[()\s\*or\*\*and\*])|(?<=[()\s\*or\*\*and\*])/g).filter(segment => segment.trim() != '');
will not work.
CodePudding user response:
Instead of matching the space between tokens, you will have an easier time trying to the tokens themselves.
const tokenizer = /\(|\)|\*[^()*\s] \*|[^()*\s] /g;
const value = "(germany*or*croatia)*and*(denmark*or*hungary)";
const tokens = value.matchAll(tokenizer);
console.log(Array.from(tokens, match => match[0]));
Note that this isn't very robust, as any unexpected token is just ignored silently. It's also very generic; you might have an easier time specifically looking for the list of allowed operators, like *or*
and *and*
, instead of producing a token for any *something*
found.
If you want to validate each token further, you can wrap each one in a capture group, and add a capture group for whitespace and for any unexpected leftovers. Keep in mind the order of |
alternatives in a regex matters!
const tokenizer = /(\()|(\))|(\*[^()*\s] \*)|([^()*\s] )|(\s )|(. )/g;
const symbols = [
Symbol("open-paren"),
Symbol("close-paren"),
Symbol("*operator*"),
Symbol("term"),
Symbol("whitespace"),
Symbol("unexpected"),
];
const value = "(germany*or*croatia)*and*(denmark or hungary)***";
const matches = value.matchAll(tokenizer);
for (match of matches) {
const str = match[0];
const groups = match.slice(1);
const tokenType = symbols[groups.findIndex(capture => capture !== undefined)];
console.log([tokenType.toString(), str]);
}
This kind of tokenizing regex was inspired by one of Douglas Crockford's books, where he uses something similar to transpile his own programming language into JS -- I forget which one.
CodePudding user response:
Could you try this expression?
/(\()|(\))|(\*or\*)|(\*and\*)|[\w]*/g