Extract sentence containing specific word and add symbols around it?-CodePudding

I need to (1) extract sentences containing specific words, (2) add asterisk symbols around the specific word,(3)keep one sentence per row if multiple sentences match any of the term I'm searching for, and (4) drop the rows that don't contain any of the words I'm searching for. My dataset looks like this.

 text
 I like to eat bananas. I like to eat apples. I like to each pears.
 I like to eat apples. I like to eat almonds. I like to eat fruits.
 I like to eat walnuts. I like to eat fruits.
 I like walnuts. I like figs.

I need find sentences containing any of these terms.

words = ['apples','fruits']

the new column should look like this.

 final_text
 I like to eat *apples*.
 I like to eat *apples*.
 I like to eat *fruits*.
 I like to eat *fruits*.

Below is my code that achieve some of the tasks, but I don't think it's efficient. My goal is to write a class that will complete all the steps.

  df["terms"]= df["text"].str.split().apply(set(words).intersection) #to create a column with the rows that match my search
  df['terms_list'] = df["terms"].apply(list)# converting set to list so I can replace the empty values (last sentence) with nan and later drop nan
  df['terms_list'].replace('', np.nan, inplace=True)
  df.dropna(subset=['terms_list'],inplace=True)

  def pick_only_key_sentence(str1, word):
result = re.findall(r’([^.]*‘ word ‘[^.]*)’, str1)
return result
  df['apples']=df[‘text’].apply(lambda x : 
  pick_only_key_sentence(x,‘apples’)) #here I was trying to output the sentences that contain the term apple, but I prefer to input a list of words instead of one word at a time.

Again, this code is not efficient at all and I'm missing a lot of steps, but it's what I have right now.

Thank you in advance for taking the time to go over this.

CodePudding user response：

here is one way to do it

Assumption: the sentences ends with a period in the text column

# split the string on period (.) and explode to make rows out of it
df2=df['text'].str.split('.').explode()

# find if list of words exists in the sentences, 
# use replace to add * around the words

df2=df['text'].str.split('.').explode()
out=(df2[df2.str.findall(f"({('|'.join(words))})")
         .apply(len)>0]
     .replace(rf"({('|'.join(words))})" ,r'*\1*', regex=True))
out

0     I like to eat *apples*
1     I like to eat *apples*
1     I like to eat *fruits*
2     I like to eat *fruits*
Name: text, dtype: object

CodePudding user response：

Here is a proposition using some of the pandas StringMethods and pandas.Series.explode :

words = ["apples", "fruits"]

out = (
        df["text"].str.replace(f"({'|'.join(words)})", lambda m: f"*{m.group(1)}*", regex=True)
                  .str.split("\B\s", regex=True)
                  .explode("text")
                  .to_frame("final_text")
                  .loc[lambda x: x["final_text"].str.contains("\*")]
                  .reset_index(drop=True)
      )

# Output :

print(out)

                final_text
0  I like to eat *apples*.
1  I like to eat *apples*.
2  I like to eat *fruits*.
3  I like to eat *fruits*.

<class 'pandas.core.frame.DataFrame'>

# Input used:

                                                                 text
0  I like to eat bananas. I like to eat apples. I like to each pears.
1  I like to eat apples. I like to eat almonds. I like to eat fruits.
2                        I like to eat walnuts. I like to eat fruits.
3                                        I like walnuts. I like figs.

<class 'pandas.core.frame.DataFrame'>