Home > front end >  Extract features from text data on python
Extract features from text data on python

Time:03-10

I have a dataframe from pandas like this.

ID            email
1            [email protected]
2            [email protected]
3            [email protected]
4            [email protected]
5            [email protected]
6            [email protected]

I want to learn company from email(after @).Sample output like this.

Sample output

ID            email                WorkGoogle     WorkFacebook    etc.....
1            [email protected]          Yes             No              ..
2            [email protected]        No              Yes             .. 
3            [email protected]          Yes             No               ..   
4            [email protected]           No              No              ..
5            [email protected]          No              No              ..
6            [email protected]        No              Yes             ..

Need to care Uppercase lowercase.

CodePudding user response:

workplace = df.email.rename("workplace").apply(lambda x: x.split("@")[1].lower())
pd.concat([df,
           pd.DataFrame(workplace.apply(lambda x: {x: "Yes"}).to_list(), index=df.index)],
          axis=1).fillna("No")

#    ID             email google.com facebook.com tesla.com hilton.com
# 0   1    [email protected]        Yes           No        No         No
# 1   2  [email protected]         No          Yes        No         No
# 2   3    [email protected]        Yes           No        No         No
# 3   4     [email protected]         No           No       Yes         No
# 4   5    [email protected]         No           No        No        Yes
# 5   6  [email protected]         No          Yes        No         No

But maybe you can just add a column instead of multiple

df["workplace"] = df.email.rename("workplace").str.lower().str.split("@").str[1]
# Then you could do
df.groupby("workplace").agg(list)
#                   ID                                 email
# workplace                                                 
# facebook.com  [2, 6]  [[email protected], [email protected]]
# google.com    [1, 3]      [[email protected], [email protected]]
# hilton.com       [5]                      [[email protected]]
# tesla.com        [4]                       [[email protected]]

CodePudding user response:

here is the dynamic way without looping, using pivot_table:

df['domain'] = df['email'].str.split('@').str[1].str.split('.').str[0].str.lower()
df = df.pivot_table(index=['ID','email'], columns='domain',aggfunc=lambda x: 'Yes' if len(x)> 0 else 'No', fill_value='No')

output:

>>
domain              facebook google hilton tesla
ID email                                        
1  [email protected]         No    Yes     No    No
2  [email protected]      Yes     No     No    No
3  [email protected]         No    Yes     No    No
4  [email protected]          No     No     No   Yes
5  [email protected]         No     No    Yes    No
6  [email protected]      Yes     No     No    No

CodePudding user response:

Assuming df is your dataframe, please try this:

import numpy as np
df['workplace'] = df['email'].str.split('@',1).apply(lambda x:x[1].split('.',1)[0])
for workplace in df['workplace'].unique():
  df.loc[:,'Work' workplace] = 'No'
  df['Work' workplace] = np.where(df['workplace']==workplace,'Yes','No')
df = df.drop(columns=['workplace'],axis=1)

CodePudding user response:

FYI: this solution is not performance efficient. I am sure in the comments on this answer, you may find a more efficient solution

I would first make a list of all companies by saying:

companies = set([email.split('@')[1].split('.')[0].lower() for email in df['email']])

Then simply iterate over this:

for company in companies:
    df['Work' company.capitalize()] = df['email'].apply(lambda x: x.split("@")[1].lower()).str.contains(company)

CodePudding user response:

if you are OK with manually making the new column for each domain, you can use this

df['WorkGoogle'] = df['email'].str.lower().str.contains('google')
df['WorkFacebook'] = df['email'].str.lower().str.contains('facebook')
# etc etc

That will give you True/False rather than Yes/No. If you want Yes/No, you can map

df['WorkGoogle'] = df['email'].str.lower().str.contains('google').map({True:'Yes',False:'No'})
  • Related