I'm cleaning up some ancient HTML help files and there are a lot of double-spaces that I'd like to clean up (replace each double-space with single).
Sample: <li><b>% Successful</b>. The percentage of jobs that returned a confirmation.</li>
I want to find double-spaces only before the start of sentences or between words (so between the setting label or between 'percentage' and 'of', not spaces in isolation or before the XML tags).
I tried a simple search for two spaces, but that also brings up the tab/space mixture that the creator used for formatting the indents, so I'm getting five useless results for every relevant one.
Is there a single regex that would help with both use cases, or is it better to use two different ones for each format? I'm fine either way, just am still pretty new to regexes and not sure where to start on this one.
Spaces between periods and text (three here for a better visual):
<li><b>Submit Time</b>. The time the job was scheduled.</li>
Spaces between words (three here as well):
<li><b>End Time</b>. The date/time when the job was completed or canceled.</li>
I'm looking for multiple spaces either between words or at the start of sentences (generally after the setting name period).
CodePudding user response:
You can use the following regex to match double spaces that are not within an XML tag:
(?<!>) \b \s{2}(?!\s)
Explanation:
(?<!>)
- is a negative lookbehind that asserts that the preceding character is not >. This ensures that the match is not within an XML tag.
\b
- is a word boundary that matches the position between a word character (as defined by \w) and a non-word character. This ensures that the double spaces are between words or before sentences.
\s{2}
- matches exactly two consecutive whitespace characters (space, tab, newline, etc.).
(?!\s)
- is a negative lookahead that asserts that the next character is not a whitespace character. This ensures that the double spaces are not isolated but are followed by a word character.