What I've managed to do
I'm new to both python and regex. With python's re.compile, in massive number of text files, I wanted to find all kinds of dashes surrounded by spaces. I used:
search.results = re.compile(r'\s[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]\s')
(Yeah, I know about the regex
module on PyPI, but I'm trying to use what I know better)
It seems to have worked fine: I got all kinds of dash-like characters with spaces around them.
What I'd like to do now
Now I'm trying to do the opposite: find all the dash-like characters that are not surrounded by spaces (that is, with a space to the left, or a space to the right, or no spaces around them at all).
What I've tried
So I tried to use the same regex above and just swap the \s at the beginning, and then the \s at the end, and then both the \s-es with \S (to find all characters that are not space-characters). And now the regex suddenly seems to have gone mad and is finding all knids of words rather than dashes and their neighbouting letters, which I expected it to do. I've no idea what's going on.
What went wrong?
CodePudding user response:
To match a specific single-char pattern not in between two chars you can use a pattern of the following type:
b(?!(?<=a.)c)
(?<!a)b|b(?!c)
where a
and c
can be the same chars.
The b(?!(?<=a.)c)
pattern matches any b
that is not immediately followed with c
that is, in its turn, not immediately preceded with a
and any one char (here, .
is fine to use because all we want from the lookbehind pattern is to reach the place after the b
).
Here, if you wanted to match a normal regular hyphen not in between whitespaces, you could use -(?!(?<=\s.)\s)
.
If you put the character class of your choice into the pattern, it will look like
(?<!\s)[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD]|[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!\s)
Or
[\u00ad\u2212\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\u002D\u058A\u05BE\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2E3A\u2E3B\uFE58\uFE63\uFF0D\u10EAD](?!(?<=\s.)\s)
See the regex demo #1 and regex demo #2. The second is more efficient.
This technique is also described in the "Matching dots or commas as (not) part of numbers" YT video of mine.
CodePudding user response:
If you need to find all the dash-like characters that are not surrounded by spaces
using Python, you need to first query the UCD for [\p{Dash=Yes}\p{General_Category=Dash_Punctuation}]
then create a class with these 30 Unicode UTF-8 characters.
This will let you use the re module and use the Unicode
regex flag.
You can create a spanned class and exclude white space on either side.
(?<!\s)[\-֊־᐀᠆‐‑‒–—―⁓⁻₋−⸗⸚⸺⸻⹀⹝〜〰゠︱︲﹘﹣-