How to create a column in a dataframe using filtering for string values from a list?-CodePudding

I have a dataframe of the following format (Actual dataframe contains more than 10000 rows)

Occupation                  Education
Engineer                    High School    
Neurosurgeon                Masters
Electrical Engineer         Masters
Mechanical Engineer         Masters
Software Engineer           Masters
Engineer                    Masters
Business Executive          Masters
Sales Executive             Bachelors
Neurosurgeon                Masters
Electrical Engineer
Accountant                  Bachelors
Sales Executive             Masters

I want to add a column based on selective filtering

I need my result to be like this

Occupation                  Education               Welfare_Cost
Engineer                    High School             50 
Neurosurgeon                Masters                 50
Electrical Engineer         Masters                 100
Mechanical Engineer         Masters                 100
Software Engineer           Masters                 100
Engineer                    Masters                 100
Business Executive          Masters                 100
Sales Executive             Bachelors               50
Neurosurgeon                Masters                 50
Electrical Engineer                                 50
Accountant                  Bachelors               50 
Sales Executive             Masters                 100

I want to only work on rows where a occupation contains a string from a list and Education is Masters I tried to achieve this using the following code where but kept getting errors.


lis=['Engineer','Executive','Teacher']

df['Welfare_Cost']=np.where(((df['Education']=='Masters')&
                        (df['Occupation'].str.contains(i for i in lis))),        
                      100,50)

I know I can also do it by running an iterative loop to create a list for each row and add that list as a column, but I have many list combinations, so I am looking for a way where I can do this without using an interative loop.

CodePudding user response：

Use join with \b\b for word boundaries by | for regex or:

lis=['Engineer','Executive','Teacher']

pat = '|'.join(r"\b{}\b".format(x) for x in lis)

df['Welfare_Cost'] = np.where(((df['Education']=='Masters') & 
                              (df['Occupation'].str.contains(pat))),
                              100,50)

Or:

df['Welfare_Cost'] = np.where(((df['Education']=='Masters') & 
                              (df['Occupation'].str.contains('|'.join(lis)))),
                              100,50)

print (df)
             Occupation  Education  Welfare_Cost
0         Engineer High     School            50
1          Neurosurgeon    Masters            50
2   Electrical Engineer    Masters           100
3   Mechanical Engineer    Masters           100
4     Software Engineer    Masters           100
5              Engineer    Masters           100
6    Business Executive    Masters           100
7       Sales Executive  Bachelors            50
8          Neurosurgeon    Masters            50
9   Electrical Engineer        NaN            50
10           Accountant  Bachelors            50
11      Sales Executive    Masters           100

Difference is possible see in changed data - \b\b match strings without substrings:

lis=['Engineer','Executive','Teacher']

df['Welfare_Cost1'] = np.where(((df['Education']=='Masters') & 
                              (df['Occupation'].str.contains('|'.join(lis)))),
                              100,50)

pat = '|'.join(r"\b{}\b".format(x) for x in lis)

df['Welfare_Cost2'] = np.where(((df['Education']=='Masters') & 
                              (df['Occupation'].str.contains(pat))),
                              100,50)

print (df)
              Occupation  Education  Welfare_Cost1  Welfare_Cost2
0          Engineer High     School             50             50
1           Neurosurgeon    Masters             50             50
2   Electrical Engineers    Masters            100             50
3    Mechanical Engineer    Masters            100            100
4      Software Engineer    Masters            100            100
5               Engineer    Masters            100            100
6     Business Executive    Masters            100            100
7        Sales Executive  Bachelors             50             50
8           Neurosurgeon    Masters             50             50
9    Electrical Engineer        NaN             50             50
10            Accountant  Bachelors             50             50
11      Sales Executives    Masters            100             50

CodePudding user response：

In your case a filter list contains only essential part of occupation name (semantically), so it'd be enough to check for str.endswith:

df['Welfare_Cost']=np.where((df['Education']=='Masters') & df['Occupation'].str.endswith(tuple(lis)),100,50)

             Occupation    Education  Welfare_Cost
0              Engineer  High School            50
1          Neurosurgeon      Masters            50
2   Electrical Engineer      Masters           100
3   Mechanical Engineer      Masters           100
4     Software Engineer      Masters           100
5              Engineer      Masters           100
6    Business Executive      Masters           100
7       Sales Executive    Bachelors            50
8          Neurosurgeon      Masters            50
9   Electrical Engineer         None            50
10           Accountant    Bachelors            50
11      Sales Executive      Masters           100