Home > Software design >  Regex alphabetic only variable length positive/negative lookbehind
Regex alphabetic only variable length positive/negative lookbehind

Time:12-10

Suppose that I have a text like below:

Lorem-Ipsum is simply dummy text of-the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd-abc-cba

I want to replace - with a whitespace if it is between alphabetic characters (letters) plus - ([a-zA-Z-]) until whitespace before/after. So, the result should be:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd abc cba

I tried:

\b(?<=[a-zA-Z] )\-(?=[a-zA-Z] )\b This is not valid since lookbehind does not allow quantifiers, and I guess even if it worked it wouldn't cover all scenarios.

Is there a way to use variable length lookbehinds, or is there any other way for this case?

Edit: Using Python re library

CodePudding user response:

You can use

re.sub(r'(?<!\S)[a-zA-Z] (?:-[a-zA-Z] ) (?!\S)', lambda x: x.group().replace('-', ' '), text)

The regex matches whitespace-separated letter-only words with at least one - in them. Then, all hyphens are replaced with spaces inside the matches.

See the regex demo. Details:

  • (?<!\S) - left-hand whitespace boundary
  • [a-zA-Z] - one or more ASCII letters
  • (?:-[a-zA-Z] ) - one or more occurrences of a - char and then one or more ASCII letters
  • (?!\S) - right-hand whitespace boundary.

Replace [a-zA-Z] with [^\W\d_] to match any Unicode letter words.

See the Python demo:

import re
text = r"Lorem-Ipsum is simply dummy text of-the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd-abc-cba"
print(re.sub(r'(?<!\S)[a-zA-Z] (?:-[a-zA-Z] ) (?!\S)', lambda x: x.group().replace('-', ' '), text))

Output:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. abc123-xyz 1abcc-xy-ef apple.pear-banana asdddd abc cba
  • Related