To split text without spaces, one can use wordninja, please see How to split text without spaces into list of words. Here is the code to do the job.
sent = "Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
import wordninja
print(' '.join(wordninja.split(sent)))
output: Test 12 to separate merged words but keep rest as it is say 1 2 2021 or 1 2 2021
The wordninja looks great and works well for splitting those merged text. My question here is that how I can split text without spaces but keep the dates (and punctuations) as they are. An ideal output will be:
Test 12 to separate merged words but keep rest as it is, say 1/2/2021 or 1.2.2021
Your help is much appreciated!
CodePudding user response:
Finally I got the following code, based on comments under my post (Thanks for comments):
import re
sent = "Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
sent = re.sub(","," ",sent)
corrected = ' '.join([' '.join(wordninja.split(w)) if w.isalnum() else w for w in sent.split(" ")])
print(corrected)
output: Test 12 to separate merged words but keep rest as it is say 1/2/2021 or 1.2.2021.
It is not a straightforward solution, but works.
CodePudding user response:
The idea here is to split our string into a list at every instance of a date then iterate over that list preserving items that matched the initial split pattern and calling wordninja.split()
on everything else. Then recombine the list with join.
import re
def foo(str):
return ' ninja '
string = 'sent = "Test12 to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."'
pattern = re.compile(r'([0-9]{1,2}[/.][0-9]{1,2}[/.][0-9]{1,4})')
# Split the string up by things matching our pattern, preserve rest of string.
string_isolated_dates = re.split(pattern, string)
# Apply wordninja to everything that doesn't match our date pattern, join it all together. OP should replace foo in the next line with wordninja.split()
wordninja_applied = ''.join([el if pattern.match(el) else foo(el) for el in string_isolated_dates])
print(wordninja_applied)
Output:
ninja 1/2/2021 ninja 1.2.2021 ninja
Note: I replaced your function wordninja.split()
with foo()
just because I don't feel like downloading yet another nlp library. But my code demonstrates modifying the original string while preserving the dates.