Matching sets consisting of letters plus non-letter characters-CodePudding

I want to match sets of characters that include a letter and non-letter characters. Many of them are a single letter. Or two letters.

const match = 'tɕ\'i mɑ mɑ ku ʂ ɪɛ'.match(/\b(p|p\'|m|f|t|t\'|n|l|k|k\'|h|tɕ|tɕ\'|ɕ|tʂ|tʂ\'|ʂ|ʐ|ts|ts\'|s)\b/g)
console.log(match)

I thought I could use \b, but it's wrong because there are "non-words" characters in the sets.

This is the current output:

[
  "t",
  "m",
  "m"
]

But I want this to be the output:

[
  "tɕ'",
  "m",
  "m",
  "k",
  "ʂ"
]

Note: notice that some sets end with a non-word boundary, like tɕ'.

(In phonetic terms, the consonants.)

CodePudding user response：

As stated in comments above \b doesn't with unicode characters in JS and moreover from your expected output it appears that you don't need word boundaries.

You can use this shortened and refactored regex:

t[ɕʂs]'?|[tkp]'?|[tmfnlhshɕʐʂ]

Code:

const s = 'tɕ\'i mɑ mɑ ku ʂ ɪɛ';
const re = /t[ɕʂs]'?|[tkp]'?|[tmfnlhshɕʐʂ]/g

console.log(s.match(re))

//=> ["tɕ'", "m", "m", "k", "ʂ" ]

RegEx Demo

RegEx Details:

- t[ɕʂs]'?: Match t followed by any letter inside [...] and then an optional '

|: OR
[tkp]'?: Match letters t or k or p and then an optional '
|: OR
[tmfnlhshɕʐʂ]): Match any letter inside [...]