Home > Mobile >  Python split text without spaces but keep dates as they are
Python split text without spaces but keep dates as they are

Time:12-06

To split text without spaces, one can use wordninja, please see How to split text without spaces into list of words. Here is the code to do the job.

sent = "Test12  to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."

import wordninja
print(' '.join(wordninja.split(sent)))

output: Test 12 to separate merged words but keep rest as it is say 1 2 2021 or 1 2 2021

The wordninja looks great and works well for splitting those merged text. My question here is that how I can split text without spaces but keep the dates (and punctuations) as they are. An ideal output will be:

Test 12 to separate merged words but keep rest as it is, say 1/2/2021 or 1.2.2021

Your help is much appreciated!

CodePudding user response:

Finally I got the following code, based on comments under my post (Thanks for comments):

import re
sent = "Test12  to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
sent = re.sub(","," ",sent)
corrected = ' '.join([' '.join(wordninja.split(w)) if w.isalnum() else w for w in sent.split(" ")])
print(corrected) 

output: Test 12  to separate merged words but keep rest as it is say 1/2/2021 or 1.2.2021.

It is not a straightforward solution, but works.

CodePudding user response:

The idea here is to split our string into a list at every instance of a date then iterate over that list preserving items that matched the initial split pattern and calling wordninja.split() on everything else. Then recombine the list with join.

import re
def foo(str):
    return ' ninja '

string = 'sent = "Test12  to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."'
pattern = re.compile(r'([0-9]{1,2}[/.][0-9]{1,2}[/.][0-9]{1,4})')

# Split the string up by things matching our pattern, preserve rest of string.
string_isolated_dates = re.split(pattern, string)

# Apply wordninja to everything that doesn't match our date pattern, join it all together. OP should replace foo in the next line with wordninja.split()
wordninja_applied = ''.join([el if pattern.match(el) else foo(el) for el in string_isolated_dates])

print(wordninja_applied)

Output:

 ninja 1/2/2021 ninja 1.2.2021 ninja

Note: I replaced your function wordninja.split() with foo() just because I don't feel like downloading yet another nlp library. But my code demonstrates modifying the original string while preserving the dates.

  • Related