Match all punctuation not surrounded by alphanumeric characters?-CodePudding

I am trying to write a regular expression that removes all non alphanumeric characters from a string, except for those that are surrounded by alphanumeric characters.

For example, consider the following three examples.

it's -> it's

its. -> its

It's a: beautiful day? I'm =sure it is. The coca-cola (is frozen right?

It's a beautiful day I'm sure it is The coca-cola is frozen right

I am using Python's re module, and can match the opposite of what I am looking for with the following expression.

(?<=[a-zA-Z])[^a-zA-Z ](?=[a-zA-Z])

Any ideas?

CodePudding user response：

Use

[^a-zA-Z\s](?!(?<=[a-zA-Z].)[a-zA-Z])

Regex proof

EXPLANATION

PATTERN	DETAILS
`[^a-zA-Z\s]`	non-letter and non-whitespace
`(?!(?<=[a-zA-Z].)[a-zA-Z])`	unmatch if followed and preceded with letter

CodePudding user response：

If alphanumeric characters can also be word characters like \w (including an underscore) you can use word boundaries:

[^a-zA-Z\s](?<!\b.\b)

Explanation

[^a-zA-Z\s]
(?<!\b.\b) Negative lookbehind, assert not a char to the left surrounded by word boundaries

Regex demo

Or another alternative using a case insensitive match excluding chars A-Z and digits on the left and right:

[^a-zA-Z\s](?<![A-Z\d].(?=[A-Z\d]))

Regex demo