Home > Back-end >  Eliminate whitespace around single letters
Eliminate whitespace around single letters

Time:12-19

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:

This i s a n example t e x t that c o n t a i n s strange spaces.

For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:

This isan example text that contains strange spaces.

I tried to achieve this with a simple perl regex:

s/ (\w) (\w) / $1$2 /g

Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:

This is a n example te x t that co n ta i ns strange spaces.

So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).

As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...

CodePudding user response:

Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).

s{\b ((\w\s) \w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;

CodePudding user response:

Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:

$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.

Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.

Explanation:

  • (?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
  • (\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
  • (?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
  • Then put back the character we captured with $1

It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

  • Related