Regex to match word ending OR beginning with a hyphen-CodePudding

I am trying to create a regex that would remove any word that either starts or ends with a hyphen (not both).

word1- -> remove -word2 -> remove sub-word ->keep

My attempt is the following:

def begin_end_hyphen_removal(line):
    return re.sub(r"((\s |^)(-[A-Za-z] )(\s |$))|((\s |^)([A-Za-z] -)(\s |$))","",line)

However, when I try to apply it on the following lines:

here are some word sub-words -word1 word2- sub-word2 word3- -word4
-word5 example
word6-
word7-
another one -word8
-word9

I get the same input as output again.

CodePudding user response：

You can use

r'\b(?<!-)[A-Za-z0-9] -\B|\B-[A-Za-z0-9] \b(?!-)'
r'\b(?<!-)\w -\B|\B-\w \b(?!-)'

See the regex demo. Details:

\b(?<!-)\w -\B - one or more word chars that are not preceded with - and then a - char that is either at the end of string or before a non-word char
| - or
\B-\w \b(?!-) - a - that is either at the start of string or after a non-word char and then one or more word chars that are not followed with -.

See the Python demo:

import re
rx = re.compile( r' *(?:\b(?<!-)\w -\B|\B-\w \b(?!-))' )
text = 'here are -some- word sub-words -word1 word2- sub-word2 word3- -word4\n-word5 example\nword6-\nword7-\nanother one -word8\n-word9'
print( rx.sub('', text) )

Output:

here are -some- word sub-words sub-word2
 example


another one

CodePudding user response：

import re

pattern = r"(?=\S*['-])([a-zA-Z'-] )"
test_string = '''here are some word sub-words -word1 word2- sub-word2 word3- -word4
-word5 example
word6-
word7-
another one -word8
-word9'''
result = re.findall(pattern, test_string)
print(result)

CodePudding user response：

You could repeat matching word characters preceded or followed by a -

If you have words that are separated by a hyphen, and that end on a hyphen that you also want to remove like for example sugar-free-:

(?<!\S)(?:-\w (?:-\w )*|\w (?:-\w )*-)(?!\S)

In parts, the pattern matches:

(?<!\S) Whitespace boundary to the left
(?: Non capture group
- -\w (?:-\w )* Match - and word chars, optionally repeated by - and word chars
- | Or
- \w (?:-\w )*- Match word chars optionally repeated by - and word chars
) Close non capture group
(?!\S) Whitespace boundary to the right

See a regex demo and a Python demo.

Note that in the pattern that you tried, you use \s, but note that it could also match a newline.

If you don't want to remove the newlines, you can use [^\S\n]* instead of \s*.

Example

import re

def begin_end_hyphen_removal(line):
    return re.sub(r"\s*(?<!\S)(?:-\w (?:-\w )*|\w (?:-\w )*-)(?!\S)", "", line)


s = ("here are some word sub-words -word1 word2- sub-word2 word3- -word4\n"
     "-word5 example\n"
     "word6-\n"
     "word7-\n"
     "another one -word8\n"
     "-word9")
print(begin_end_hyphen_removal(s))

Output

here are some word sub-words sub-word2 example
another one