I need to (1) extract sentences containing specific words, (2) add asterisk symbols around the specific word,(3)keep one sentence per row if multiple sentences match any of the term I'm searching for, and (4) drop the rows that don't contain any of the words I'm searching for. My dataset looks like this.
text
I like to eat bananas. I like to eat apples. I like to each pears.
I like to eat apples. I like to eat almonds. I like to eat fruits.
I like to eat walnuts. I like to eat fruits.
I like walnuts. I like figs.
I need find sentences containing any of these terms.
words = ['apples','fruits']
the new column should look like this.
final_text
I like to eat *apples*.
I like to eat *apples*.
I like to eat *fruits*.
I like to eat *fruits*.
Below is my code that achieve some of the tasks, but I don't think it's efficient. My goal is to write a class that will complete all the steps.
df["terms"]= df["text"].str.split().apply(set(words).intersection) #to create a column with the rows that match my search
df['terms_list'] = df["terms"].apply(list)# converting set to list so I can replace the empty values (last sentence) with nan and later drop nan
df['terms_list'].replace('', np.nan, inplace=True)
df.dropna(subset=['terms_list'],inplace=True)
def pick_only_key_sentence(str1, word):
result = re.findall(r’([^.]*‘ word ‘[^.]*)’, str1)
return result
df['apples']=df[‘text’].apply(lambda x :
pick_only_key_sentence(x,‘apples’)) #here I was trying to output the sentences that contain the term apple, but I prefer to input a list of words instead of one word at a time.
Again, this code is not efficient at all and I'm missing a lot of steps, but it's what I have right now.
Thank you in advance for taking the time to go over this.
CodePudding user response:
here is one way to do it
Assumption: the sentences ends with a period in the text column
# split the string on period (.) and explode to make rows out of it
df2=df['text'].str.split('.').explode()
# find if list of words exists in the sentences,
# use replace to add * around the words
df2=df['text'].str.split('.').explode()
out=(df2[df2.str.findall(f"({('|'.join(words))})")
.apply(len)>0]
.replace(rf"({('|'.join(words))})" ,r'*\1*', regex=True))
out
0 I like to eat *apples*
1 I like to eat *apples*
1 I like to eat *fruits*
2 I like to eat *fruits*
Name: text, dtype: object
CodePudding user response:
Here is a proposition using some of the pandas StringMethods
and pandas.Series.explode
:
words = ["apples", "fruits"]
out = (
df["text"].str.replace(f"({'|'.join(words)})", lambda m: f"*{m.group(1)}*", regex=True)
.str.split("\B\s", regex=True)
.explode("text")
.to_frame("final_text")
.loc[lambda x: x["final_text"].str.contains("\*")]
.reset_index(drop=True)
)
# Output :
print(out)
final_text
0 I like to eat *apples*.
1 I like to eat *apples*.
2 I like to eat *fruits*.
3 I like to eat *fruits*.
<class 'pandas.core.frame.DataFrame'>
# Input used:
text
0 I like to eat bananas. I like to eat apples. I like to each pears.
1 I like to eat apples. I like to eat almonds. I like to eat fruits.
2 I like to eat walnuts. I like to eat fruits.
3 I like walnuts. I like figs.
<class 'pandas.core.frame.DataFrame'>