Is there a way to remove unwanted spaces from a string using Python or some NLP technique?? (NOT tra-CodePudding

s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest (similar to the amount originally intended for commercial pr operty)"

This is a text scarped from a web pdf using basic python and its PyPDF library

I want to remove the unwanted spaces in the bold words.

Note: I have manually made them bold just to explain my problem. I would appreciate, if someone could help.. Thanks a lot in advance!

CodePudding user response：

This method removes the whitespace in a word

def remove_space_in_word(text, word):
    index = text.find(word)
    parts = word.split(" ")
    part1_len = len(parts[0])
    return text[:index   part1_len]   text[index   part1_len   1:]

Output:

CodePudding user response：

The simple manual method

If you have already identified that 'pr operty' tends to be written with an extra space, here is a simple function that will remove whitespace from all occurrences of pr operty:

def remove_whitespace_in_word(text, word):
    return text.replace(word, ''.join(word.split()))

s = "The pr operty. Over 20 years of pr operty, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our pr operty policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in pr operty (similar to the amount originally intended for commercial pr operty)"

new_text = remove_whitespace_in_word(s, 'pr operty')

print(new_text)
# 'The property. Over 20 years of property, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively invest ing in commercial property, for which the business case is unclear, we propose that the Council j oin the growing ranks of local authorities developing new solar farms. This meets our property policy objectives and provides a modest, but secure, return (net of borrowing). The £51m we propose to invest in property (similar to the amount originally intended for commercial property)'

You only need to call it once to fix all occurrences of pr operty; but you need to call it again for every other offending word, such as ch arge.

The complicated automated method

Here is a proposed algorithm. It's not perfect, but should deal with many errors:

Load a data structure holding all known English words, for instance the dictionary of Scrabble words.
Look for words in your text that are not in the dictionary.
Try to fix each offending word by merging it with the adjacent word that comes before or the adjacent word that comes after.
When attempting to merge, there are several possibilities. If the word after is also offending and merging them results in a non-offending word, it's likely a good fit. If the word after is not offending but merging them results in a non-offending word, it's maybe still a good fit. If the word after is not offending and merging them doesn't result in a non-offending word, it's probably not a good fit.
Generate a log of all the fixes that were performed, so that a user can read the log and make sure that the fixes look legit. Generating a log is really important; you don't want your algorithm to edit the text without keeping a trace of what was edited.
You could even do an interactive step, where the computer proposes a fix but waits for the user to validate it. When the user validates a fix, memorise it so that if another fix is identical, the user doesn't need to be asked again. For instance if there are several occurrences of "pr operty" in the text, you only need to ask confirmation once.