Home > Blockchain >  How to extract only sentences from some texts in Python?
How to extract only sentences from some texts in Python?

Time:01-27

I have a text that is in the following form:

document = "Hobby: I like going to the mountains.\n Something (where): To Everest mountain.\n\n 
            The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "

I want to extract only the sentences from this text, without the "Hobby:", "Something (where):", "The reason". Only the sentences. For example, "To Everest mountain" would not be a sentence, since it is not like a full sentence.

The idea is that I need to get rid of those words followed by ":" (Hobby:, The reason:) (it doesn't matter what's written before the ":" part, the idea is to get rid of that if it is at the beginning of the "sentence") and extract only the sentences from what it remained.

I'd appreciate any idea.

CodePudding user response:

You can just use the split() method. First, you should split by \n and then check if there is ":" in the sentence and append to the final list the second part of this sentence split by ": ". Here is the code:

document = "Hobby: I like going to the mountains.\n Something (where): To Everest mountain.\n\n The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
sentences = []
for element in document.split("\n"):
    if ":" in element:
        sentences.append(element.split(": ")[1])
print(*sentences, sep="\n")

And the output will be:

I like going to the mountains.
To Everest mountain.
I want to go because I like nature.
I'd like to go hiking and admiring the beauty of the nature. 

But if the sentence can also contain ": " you should use the following code:

document = "Hobby: I like go: ing to the mountain: s.\n Something (where): To Everest mountain.\n\n The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
sentences = []
for element in document.split("\n"):
    if ":" in element:
        sentences.append(element.split(": ")[1:])
for line in sentences:
    print(": ".join(line))

Output:

I like go: ing to the mountain: s.
To Everest mountain.
I want to go because I like nature.
I'd like to go hiking and admiring the beauty of the nature. 

Hope that helped!

CodePudding user response:

If the text file is structured in the way, that each sentence is separated by newline character parsing it with regex might be feasible.

As other answer mentioned, use `split() function to separate sentences.

lines = document.split("\n")

With that you can apply regex to each line:

import re

sentences = []

for line in lines:
    result = re.search(r"^[a-zA-z0-9\s]*:(.*)", line)
    if not result:
        continue
    sentences.extend(result.groups())

print(sentences)

To check out what the regex does, visit a website such as https://regex101.com/

In short: It checks for alphanumerical characters until first : symbol, then grab everything after that. ^ symbol is crucial here as it's asserting position at the start of the line, this way you won't grab another :

  • Related