I am trying to write a regular expression that removes all non alphanumeric characters from a string, except for those that are surrounded by alphanumeric characters.
For example, consider the following three examples.
1.
it's
-> it's
2.
its.
-> its
3.
It's a: beautiful day? I'm =sure it is. The coca-cola (is frozen right?
It's a beautiful day I'm sure it is The coca-cola is frozen right
I am using Python's re module, and can match the opposite of what I am looking for with the following expression.
(?<=[a-zA-Z])[^a-zA-Z ](?=[a-zA-Z])
Any ideas?
CodePudding user response:
Use
[^a-zA-Z\s](?!(?<=[a-zA-Z].)[a-zA-Z])
EXPLANATION
PATTERN | DETAILS |
---|---|
[^a-zA-Z\s] |
non-letter and non-whitespace |
(?!(?<=[a-zA-Z].)[a-zA-Z]) |
unmatch if followed and preceded with letter |
CodePudding user response:
If alphanumeric characters can also be word characters like \w
(including an underscore) you can use word boundaries:
[^a-zA-Z\s](?<!\b.\b)
Explanation
[^a-zA-Z\s]
(?<!\b.\b)
Negative lookbehind, assert not a char to the left surrounded by word boundaries
Or another alternative using a case insensitive match excluding chars A-Z and digits on the left and right:
[^a-zA-Z\s](?<![A-Z\d].(?=[A-Z\d]))