Home > Enterprise >  If contains the substring, generate a new column with TRUE
If contains the substring, generate a new column with TRUE

Time:09-23

I am trying to tag TRUE or FALSE to an email message dataframe that has columns SenderEmail, Counterparties, and MessageBody

df['Spam'] = df['SenderEmail'].apply(lambda x: True if "no" and "reply" in x.lower() else "")
df['Spam'] = df['MessageBody'].apply(lambda x: True if "please do not reply" in x.lower() else "")

The code works, but I realise that after I ran one after the other, the results from the second line code will overrun the results from the first line code, leaving me with the results from the second line code only. I can’t remove the else “” while using this, so I was thinking to run a for loop instead. But I’m not sure how to do so.

CodePudding user response:

You can use

df['Spam'] = (df['SenderEmail'].str.contains('^(?=.*no)(?=.*reply)', case=False) | 
              df['MessageBody'].str.contains('please do not reply', case=False))

Here,

  • df['SenderEmail'].str.contains('^(?=.*no)(?=.*reply)', case=False) checks if the SenderEmail column value contains both substrings no and reply
  • df['MessageBody'].str.contains('please do not reply', case=False) checks if MessageBody column contains please do not reply substring.

The case=False enables case insensitive checking.

Pandas test:

import pandas as pd
df = pd.DataFrame(
    {'SenderEmail': ['no reply', 'reply', 'no', 'and more no some reply'], 
     'MessageBody':['ok', 'please do not reply', 'ok', 'ok']})
df['Spam'] = (df['SenderEmail'].str.contains('^(?=.*no)(?=.*reply)', case=False) | 
              df['MessageBody'].str.contains('please do not reply', case=False))
# => df
#                 SenderEmail          MessageBody   Spam
#   0                no reply                   ok   True
#   1                   reply  please do not reply   True
#   2                      no                   ok  False
#   3  and more no some reply                   ok   True
  • Related