I am trying to create a regex that would remove any word that either starts or ends with a hyphen (not both).
word1-
-> remove
-word2
-> remove
sub-word
->keep
My attempt is the following:
def begin_end_hyphen_removal(line):
return re.sub(r"((\s |^)(-[A-Za-z] )(\s |$))|((\s |^)([A-Za-z] -)(\s |$))","",line)
However, when I try to apply it on the following lines:
here are some word sub-words -word1 word2- sub-word2 word3- -word4
-word5 example
word6-
word7-
another one -word8
-word9
I get the same input as output again.
CodePudding user response:
You can use
r'\b(?<!-)[A-Za-z0-9] -\B|\B-[A-Za-z0-9] \b(?!-)'
r'\b(?<!-)\w -\B|\B-\w \b(?!-)'
See the regex demo. Details:
\b(?<!-)\w -\B
- one or more word chars that are not preceded with-
and then a-
char that is either at the end of string or before a non-word char|
- or\B-\w \b(?!-)
- a-
that is either at the start of string or after a non-word char and then one or more word chars that are not followed with-
.
See the Python demo:
import re
rx = re.compile( r' *(?:\b(?<!-)\w -\B|\B-\w \b(?!-))' )
text = 'here are -some- word sub-words -word1 word2- sub-word2 word3- -word4\n-word5 example\nword6-\nword7-\nanother one -word8\n-word9'
print( rx.sub('', text) )
Output:
here are -some- word sub-words sub-word2
example
another one
CodePudding user response:
import re
pattern = r"(?=\S*['-])([a-zA-Z'-] )"
test_string = '''here are some word sub-words -word1 word2- sub-word2 word3- -word4
-word5 example
word6-
word7-
another one -word8
-word9'''
result = re.findall(pattern, test_string)
print(result)
CodePudding user response:
You could repeat matching word characters preceded or followed by a -
If you have words that are separated by a hyphen, and that end on a hyphen that you also want to remove like for example sugar-free-
:
(?<!\S)(?:-\w (?:-\w )*|\w (?:-\w )*-)(?!\S)
In parts, the pattern matches:
(?<!\S)
Whitespace boundary to the left(?:
Non capture group-\w (?:-\w )*
Match-
and word chars, optionally repeated by-
and word chars|
Or\w (?:-\w )*-
Match word chars optionally repeated by-
and word chars
)
Close non capture group(?!\S)
Whitespace boundary to the right
See a regex demo and a Python demo.
Note that in the pattern that you tried, you use \s
, but note that it could also match a newline.
If you don't want to remove the newlines, you can use [^\S\n]*
instead of \s*
.
Example
import re
def begin_end_hyphen_removal(line):
return re.sub(r"\s*(?<!\S)(?:-\w (?:-\w )*|\w (?:-\w )*-)(?!\S)", "", line)
s = ("here are some word sub-words -word1 word2- sub-word2 word3- -word4\n"
"-word5 example\n"
"word6-\n"
"word7-\n"
"another one -word8\n"
"-word9")
print(begin_end_hyphen_removal(s))
Output
here are some word sub-words sub-word2 example
another one