Home > database >  How to extract specific words from a string?
How to extract specific words from a string?

Time:07-15

I have to extract from a string two things. one contains stop-words, and one contains the rest.

text = 'he is the best when people in our life'

stopwords = ['he', 'the', 'our']

contains_stopwords = []
normal_words = []
for i in text.split():
    for j in stopwords:
        if i in j:
            contains_stopwords.append(i)
        else:
            normal_words.append(i)
if text.split() in stopwords:
    contains_stopwords.append(text.split())
else:
    normal_words.append(text.split())

print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

Output:

contains_stopwords: ['he', 'he', 'the', 'our']
normal_words: ['he', 'is', 'is', 'is', 'the', 'the', 'best', 'best', 'best', 'when', 'when', 'when', 'people', 'people', 'people', 'in', 'in', 'in', 'our', 'our', 'life', 'life', 'life', ['he', 'is', 'the', 'best', 'when', 'people', 'in', 'our', 'life']]

Wanted result:

contains_stopwords: ['he', 'the', 'our']
normal_words: ['is', 'best', 'when', 'people', 'in', ''life]

CodePudding user response:

You can split on space and check whether each word exists in stopwords or not and append to desired list:

text = 'he is the best when people in our life'

stopwords = ['he', 'the', 'our']

normal_words = []
contains_stopwords = []
for word in text.split():
    if not word in  stopwords:
        normal_words.append(word)
    else:
        contains_stopwords.append(word)
        
print(f'contains_stopwords: {contains_stopwords}')
print(f'normal_words: {normal_words}')

contains_stopwords: ['he', 'the', 'our']
normal_words: ['is', 'best', 'when', 'people', 'in', 'life']

CodePudding user response:

Use the list comprehention:

itext = 'he is the best when people in our life'

stopwords = ['he', 'the', 'our']

split_words = itext.split(' ')
contains_stopwords = [word for word in split_words if word in stopwords]
normal_words = [word for word in split_words if word not in stopwords]

print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

CodePudding user response:

you seem to have chosen the most difficult path. The code under should do the trick.

for word in text.split():
    if word in stopwords:
        contains_stopwords.append(word)
    else:
        normal_words.append(word)

First, we separate the text into a list of words using split, then we iterate and check if that word is in the list of stopwords (yeah, python allows you to do this). If it is, we just append it to the list of stopwords, if not, we append it to the other list.

CodePudding user response:

One answer could be:

text = 'he is the best when people in our life'

stopwords = ['he', 'the', 'our']

contains_stopwords = set() # The set data structure guarantees there won't be any duplicate
normal_words = []

for word in text.split():
    if word in stopwords:
        contains_stopwords.add(word) 
    else:
        normal_words.append(word)

print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

CodePudding user response:

Some list comprehension could work and then use set() to remove duplicates from the list. I reconverted the set datastructure to a list as per your question, but you can leave it as a set:

text = 'he is the best when people in our life he he he'
stopwords = ['he', 'the', 'our']

list1 = list(set([item for item in text.split(" ") if item in stopwords]))
list2 = [item for item in text.split(" ") if item not in list1]

Output:

list1 - ['he', 'the', 'our']
list2 - ['is', 'best', 'when', 'people', 'in', 'life']
  • Related