Home > OS >  How to merge hyphenated words with regex in Notepad ?
How to merge hyphenated words with regex in Notepad ?

Time:09-21

I have numerous OCR-ed texts with hyphenated words in the middle of lines.

Example: This is a text with a hyphen- ated word in the middle of the sentence. But it also has - dashes - like the ones in the second sentence. The latter should not be modified.

I would like to have a cleaned text like the one below where the hyphenated words are merged:

This is a text with a hyphenated word in the middle. But it also has - dashes - like the ones in the second sentence. The latter should not be modified.

By removing the hyphen, this -\s*\r?\n\s*\r?\n? regex merges the hyphenated words if the hyphen is located at the end of the lines. How to modify this regex to do the above job? The number of spaces after the hyphen can be 1, 2 or 3 like hyphen- ated, hyphen- ated, hyphen- ated.

CodePudding user response:

You can look for a non-space (the end of a word) followed by -:

([^\s])(-\s*)

Then simply replace with $1 to leave the last character of the word intact.

Here is a working example on regex101.com:
https://regex101.com/r/V0mmBH/1

CodePudding user response:

Using notepad you can use thia pattern and replace with an empty string:

[^\s-]\K-\s{1,3}

The pattern matches:

  • [^\s-] Match a single char other than - or a whitespace char
  • \K Forget what is matched so far
  • -\s{1,3} Match - and 1-3 whitespace chars to be removed

Regex demo

Another variant matching 1 whitespace chars and asserting a single char other than - or a whitespace char to the right

[^\s-]\K-\s (?=[^\s-])

Regex demo

Or with the 1-3 quantifier and the lookahead:

[^\s-]\K-\s{1,3}(?=[^\s-])
  • Related