Dashes with(out) spaces with python's regex-CodePudding

What I've managed to do

I'm new to both python and regex. With python's re.compile, in massive number of text files, I wanted to find all kinds of dashes surrounded by spaces. I used:

search.results = re.compile(r'\s[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]\s')

(Yeah, I know about the regex module on PyPI, but I'm trying to use what I know better) It seems to have worked fine: I got all kinds of dash-like characters with spaces around them.

What I'd like to do now

Now I'm trying to do the opposite: find all the dash-like characters that are not surrounded by spaces (that is, with a space to the left, or a space to the right, or no spaces around them at all).

What I've tried

So I tried to use the same regex above and just swap the \s at the beginning, and then the \s at the end, and then both the \s-es with \S (to find all characters that are not space-characters). And now the regex suddenly seems to have gone mad and is finding all knids of words rather than dashes and their neighbouting letters, which I expected it to do. I've no idea what's going on.

What went wrong?

CodePudding user response：

To match a specific single-char pattern not in between two chars you can use a pattern of the following type:

b(?!(?<=a.)c)
(?<!a)b|b(?!c)

where a and c can be the same chars.

The b(?!(?<=a.)c) pattern matches any b that is not immediately followed with c that is, in its turn, not immediately preceded with a and any one char (here, . is fine to use because all we want from the lookbehind pattern is to reach the place after the b).

Here, if you wanted to match a normal regular hyphen not in between whitespaces, you could use -(?!(?<=\s.)\s).

If you put the character class of your choice into the pattern, it will look like

(?<!\s)[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]|[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!\s)

[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!(?<=\s.)\s)

See the regex demo #1 and regex demo #2. The second is more efficient.

This technique is also described in the "Matching dots or commas as (not) part of numbers" YT video of mine.

CodePudding user response：

If you need to find all the dash-like characters that are not surrounded by spaces
using Python, you need to first query the UCD for [\p{Dash=Yes}\p{General_Category=Dash_Punctuation}]
then create a class with these 30 Unicode UTF-8 characters.
This will let you use the re module and use the Unicode regex flag.
You can create a spanned class and exclude white space on either side.
(?<!\s)[\-֊־᐀᠆‐‑‒–—―⁓⁻₋−⸗⸚⸺⸻⹀⹝〜〰゠︱︲﹘﹣－