Home > Enterprise >  Match 1 or more free standing non unicode alpha numeric characters
Match 1 or more free standing non unicode alpha numeric characters

Time:10-06

I'm looking for a single pattern to match, free-standing collections of non-Unicode alpha numeric characters. I will eventually do a replace with a single space.

Prerequisite

  • In regard to alpha characters, the Unicode category \p{L} is necessary
  • In regard to numeric \d is adequate
  • white space is included

Match Examples

'/' denotes any non-unicode alpha numeric character

aàa 111 /
   ^   ^^
aàa / 111
   ^^^
aàa /// 111
   ^^^^^
aàa/// 111
   ^^^^
aàa ///111
   ^^^^
aàa *&^#* 111
   ^^^^^^^
)(*)* 111
^^^^^^
à- 1
 ^^
à -1
 ^^

Unmatched Examples

aàa///111
aàa-111
aà-/*&^*-a-1-1-1

What I have so far

  • The pattern [^\p{L}\d] will match any non-alpha numeric pattern.
  • Zero-width negative lookahead / lookbehind with word boundaries gets it closer e.g. (?<!\b)[^\p{L}\d](?!\b)

However, a pattern that solves all the above examples has been elusive

Note: my spidey senses tell me this is likely possible with a single pattern. Though, if this is more efficient or practical as 2 separate patterns, so be it.

CodePudding user response:

\b word boundaries are problematic because those match a boundary between \w and \W, but you're not using \w and \W.

It looks like you always want whitespace on one side or the other of a match so that needs to be worked in. Give this a try. It matches [^\p{L}\d\n]* either preceded or followed by [ \t] .

[ \t] [^\p{L}\d\n]*|[^\p{L}\d\n]*[ \t] 

Demo:

Regex101.com demo

Visual Representation

enter image description here

  • Related