Home > Enterprise >  split on a multi character delimiter using regex expression
split on a multi character delimiter using regex expression

Time:09-17

I am using this code to split text into segments:

let value = "(germany or croatia) and (denmark or hungary)"
let tokens = value.split(/(?=[()\s])|(?<=[()\s])/g).filter(segment => segment.trim() != '');

This produces the following array:

['(', 'germany', 'or', 'croatia', ')', 'and', '(', 'denmark', 'or', 'hungary', ')']

How should I rewrite regex so that it would be able to split this string:

(germany*or*croatia)*and*(denmark*or*hungary)

into

['(', 'germany', '*or*', 'croatia', ')', '*and*', '(', 'denmark', '*or*', 'hungary', ')']

The problem is that split paramrs *or* and *and* are of multiple characters and just using

let tokens = value.split(/(?=[()\s\*or\*\*and\*])|(?<=[()\s\*or\*\*and\*])/g).filter(segment => segment.trim() != '');

will not work.

CodePudding user response:

Instead of matching the space between tokens, you will have an easier time trying to the tokens themselves.

const tokenizer = /\(|\)|\*[^()*\s] \*|[^()*\s] /g;
const value = "(germany*or*croatia)*and*(denmark*or*hungary)";
const tokens = value.matchAll(tokenizer);
console.log(Array.from(tokens, match => match[0]));

Note that this isn't very robust, as any unexpected token is just ignored silently. It's also very generic; you might have an easier time specifically looking for the list of allowed operators, like *or* and *and*, instead of producing a token for any *something* found.

If you want to validate each token further, you can wrap each one in a capture group, and add a capture group for whitespace and for any unexpected leftovers. Keep in mind the order of | alternatives in a regex matters!

const tokenizer = /(\()|(\))|(\*[^()*\s] \*)|([^()*\s] )|(\s )|(. )/g;
const symbols = [
    Symbol("open-paren"),
    Symbol("close-paren"),
    Symbol("*operator*"),
    Symbol("term"),
    Symbol("whitespace"),
    Symbol("unexpected"),
];

const value = "(germany*or*croatia)*and*(denmark or hungary)***";
const matches = value.matchAll(tokenizer);

for (match of matches) {
    const str = match[0];
    const groups = match.slice(1);
    
    const tokenType = symbols[groups.findIndex(capture => capture !== undefined)];
    console.log([tokenType.toString(), str]);
}

This kind of tokenizing regex was inspired by one of Douglas Crockford's books, where he uses something similar to transpile his own programming language into JS -- I forget which one.

CodePudding user response:

Could you try this expression?

/(\()|(\))|(\*or\*)|(\*and\*)|[\w]*/g

  • Related