I have a dataframe of the following format (Actual dataframe contains more than 10000 rows)
Occupation Education
Engineer High School
Neurosurgeon Masters
Electrical Engineer Masters
Mechanical Engineer Masters
Software Engineer Masters
Engineer Masters
Business Executive Masters
Sales Executive Bachelors
Neurosurgeon Masters
Electrical Engineer
Accountant Bachelors
Sales Executive Masters
I want to add a column based on selective filtering
I need my result to be like this
Occupation Education Welfare_Cost
Engineer High School 50
Neurosurgeon Masters 50
Electrical Engineer Masters 100
Mechanical Engineer Masters 100
Software Engineer Masters 100
Engineer Masters 100
Business Executive Masters 100
Sales Executive Bachelors 50
Neurosurgeon Masters 50
Electrical Engineer 50
Accountant Bachelors 50
Sales Executive Masters 100
I want to only work on rows where a occupation contains a string from a list and Education is Masters I tried to achieve this using the following code where but kept getting errors.
lis=['Engineer','Executive','Teacher']
df['Welfare_Cost']=np.where(((df['Education']=='Masters')&
(df['Occupation'].str.contains(i for i in lis))),
100,50)
I know I can also do it by running an iterative loop to create a list for each row and add that list as a column, but I have many list combinations, so I am looking for a way where I can do this without using an interative loop.
CodePudding user response:
Use join
with \b\b
for word boundaries by |
for regex or
:
lis=['Engineer','Executive','Teacher']
pat = '|'.join(r"\b{}\b".format(x) for x in lis)
df['Welfare_Cost'] = np.where(((df['Education']=='Masters') &
(df['Occupation'].str.contains(pat))),
100,50)
Or:
df['Welfare_Cost'] = np.where(((df['Education']=='Masters') &
(df['Occupation'].str.contains('|'.join(lis)))),
100,50)
print (df)
Occupation Education Welfare_Cost
0 Engineer High School 50
1 Neurosurgeon Masters 50
2 Electrical Engineer Masters 100
3 Mechanical Engineer Masters 100
4 Software Engineer Masters 100
5 Engineer Masters 100
6 Business Executive Masters 100
7 Sales Executive Bachelors 50
8 Neurosurgeon Masters 50
9 Electrical Engineer NaN 50
10 Accountant Bachelors 50
11 Sales Executive Masters 100
Difference is possible see in changed data - \b\b
match strings without substrings:
lis=['Engineer','Executive','Teacher']
df['Welfare_Cost1'] = np.where(((df['Education']=='Masters') &
(df['Occupation'].str.contains('|'.join(lis)))),
100,50)
pat = '|'.join(r"\b{}\b".format(x) for x in lis)
df['Welfare_Cost2'] = np.where(((df['Education']=='Masters') &
(df['Occupation'].str.contains(pat))),
100,50)
print (df)
Occupation Education Welfare_Cost1 Welfare_Cost2
0 Engineer High School 50 50
1 Neurosurgeon Masters 50 50
2 Electrical Engineers Masters 100 50
3 Mechanical Engineer Masters 100 100
4 Software Engineer Masters 100 100
5 Engineer Masters 100 100
6 Business Executive Masters 100 100
7 Sales Executive Bachelors 50 50
8 Neurosurgeon Masters 50 50
9 Electrical Engineer NaN 50 50
10 Accountant Bachelors 50 50
11 Sales Executives Masters 100 50
CodePudding user response:
In your case a filter list contains only essential part of occupation name (semantically), so it'd be enough to check for str.endswith
:
df['Welfare_Cost']=np.where((df['Education']=='Masters') & df['Occupation'].str.endswith(tuple(lis)),100,50)
Occupation Education Welfare_Cost
0 Engineer High School 50
1 Neurosurgeon Masters 50
2 Electrical Engineer Masters 100
3 Mechanical Engineer Masters 100
4 Software Engineer Masters 100
5 Engineer Masters 100
6 Business Executive Masters 100
7 Sales Executive Bachelors 50
8 Neurosurgeon Masters 50
9 Electrical Engineer None 50
10 Accountant Bachelors 50
11 Sales Executive Masters 100