Suppose that I have a text like below:
Lorem-Ipsum is simply dummy text of-the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd-abc-cba
I want to replace -
with a whitespace if it is between alphabetic characters (letters) plus -
([a-zA-Z-]
) until whitespace before/after. So, the result should be:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd abc cba
I tried:
\b(?<=[a-zA-Z] )\-(?=[a-zA-Z] )\b
This is not valid since lookbehind does not allow quantifiers, and I guess even if it worked it wouldn't cover all scenarios.
Is there a way to use variable length lookbehinds, or is there any other way for this case?
Edit: Using Python re library
CodePudding user response:
You can use
re.sub(r'(?<!\S)[a-zA-Z] (?:-[a-zA-Z] ) (?!\S)', lambda x: x.group().replace('-', ' '), text)
The regex matches whitespace-separated letter-only words with at least one -
in them. Then, all hyphens are replaced with spaces inside the matches.
See the regex demo. Details:
(?<!\S)
- left-hand whitespace boundary[a-zA-Z]
- one or more ASCII letters(?:-[a-zA-Z] )
- one or more occurrences of a-
char and then one or more ASCII letters(?!\S)
- right-hand whitespace boundary.
Replace [a-zA-Z]
with [^\W\d_]
to match any Unicode letter words.
See the Python demo:
import re
text = r"Lorem-Ipsum is simply dummy text of-the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd-abc-cba"
print(re.sub(r'(?<!\S)[a-zA-Z] (?:-[a-zA-Z] ) (?!\S)', lambda x: x.group().replace('-', ' '), text))
Output:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd abc cba