Home > Software engineering >  Dashes with(out) spaces with python's regex
Dashes with(out) spaces with python's regex

Time:02-28

What I've managed to do

I'm new to both python and regex. With python's re.compile, in massive number of text files, I wanted to find all kinds of dashes surrounded by spaces. I used:

search.results = re.compile(r'\s[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]\s')

(Yeah, I know about the regex module on PyPI, but I'm trying to use what I know better) It seems to have worked fine: I got all kinds of dash-like characters with spaces around them.

What I'd like to do now

Now I'm trying to do the opposite: find all the dash-like characters that are not surrounded by spaces (that is, with a space to the left, or a space to the right, or no spaces around them at all).

What I've tried

So I tried to use the same regex above and just swap the \s at the beginning, and then the \s at the end, and then both the \s-es with \S (to find all characters that are not space-characters). And now the regex suddenly seems to have gone mad and is finding all knids of words rather than dashes and their neighbouting letters, which I expected it to do. I've no idea what's going on.

What went wrong?

CodePudding user response:

To match a specific single-char pattern not in between two chars you can use a pattern of the following type:

b(?!(?<=a.)c)
(?<!a)b|b(?!c)

where a and c can be the same chars.

The b(?!(?<=a.)c) pattern matches any b that is not immediately followed with c that is, in its turn, not immediately preceded with a and any one char (here, . is fine to use because all we want from the lookbehind pattern is to reach the place after the b).

Here, if you wanted to match a normal regular hyphen not in between whitespaces, you could use -(?!(?<=\s.)\s).

If you put the character class of your choice into the pattern, it will look like

(?<!\s)[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]|[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!\s)

Or

[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!(?<=\s.)\s)

See the regex demo #1 and regex demo #2. The second is more efficient.

This technique is also described in the "Matching dots or commas as (not) part of numbers" YT video of mine.

CodePudding user response:

If you need to find all the dash-like characters that are not surrounded by spaces
using Python, you need to first query the UCD for [\p{Dash=Yes}\p{General_Category=Dash_Punctuation}]
then create a class with these 30 Unicode UTF-8 characters.
This will let you use the re module and use the Unicode regex flag.
You can create a spanned class and exclude white space on either side.
(?<!\s)[\-֊־᐀᠆‐‑‒–—―⁓⁻₋−⸗⸚⸺⸻⹀⹝〜〰゠︱︲﹘﹣-

  • Related