I have a big list of paragraphs of varying length and number of sentences in them, e.g.,
Blah. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Blah blah blah. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Blah blah.
I want to remove all short sentences, say everything that is less than 4 words.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
What would be the most efficient way to do it?
CodePudding user response:
Here is a solution you can use. I hope this helps.
test = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
new=""
test = test.split()
for item in test:
if len(item)< 4:
continue
else:
new =item " "
print(new)
CodePudding user response:
A short solution would be:
old_string = "Blah. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Blah blah blah. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Blah blah."
new_string = ' '.join([w for w in old_string.split() if len(w)>3])#If the wordsare shorter than the numbers, they will not appear in the new list
print(new_string)
CodePudding user response:
Assuming that sentences are separated by dot .
only.
text = "Blah. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Blah blah blah. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Blah blah."
new_text = ''.join([ x.lstrip() '. ' for x in text.split('.') if len(x.split()) > 4 ])
print(new_text)
'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.'