I frequently receive PDFs that contain (when converted with pdftotext
) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
CodePudding user response:
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s) \w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
CodePudding user response:
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n
cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S)
negative look-behind assertion checks that the character behind is not a non-whitespace.(\S)
next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).(?=\S )
next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.- Then put back the character we captured with
$1
It might be more correct to use [^ ]
instead of \S
. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.