I have a dataframe with employee IDs, the job they perform, the location the job is performed and a column that designates if that job is the primary job of that employee, like this one:
ID Job Location Main Job?
10 Driver Peru street Yes
11 Nurse Jorge street No
11 Helper Jorge street Yes
11 Driver Phoenix av. No
11 Admin Jorge street No
12 Driver Yard av. Yes
12 Helper Yard av. No
12 Nurse Jorge street No
13 Driver Peru street Yes
13 Helper Jorge street No
As you can see, an employee may perform different jobs on the same Location but only one job is marked as "Yes" in the "Main Job?" column by each unique "ID".
I need now to create a column named "Primary Location?" which always mark as "Yes" if the job performed by each employee (represented by the unique ID) is performed on the Location the "Main Job" also occurs (and setting as No if do not occur in the Location of the main job). This way resulting on the following dataframe:
ID Job Location Main Job? Primary Location?
10 Driver Peru street Yes Yes
11 Nurse Jorge street No Yes
11 Helper Jorge street Yes Yes
11 Driver Phoenix av. No No
11 Admin Jorge street No Yes
12 Driver Yard av. Yes Yes
12 Helper Yard av. No Yes
12 Nurse Jorge street No No
13 Driver Peru street Yes Yes
13 Helper Jorge street No No
What is the best way to achieve this logic?
Thank you!
CodePudding user response:
You need to do the following:
import pandas as pd
df = pd.DataFrame({
'ID': [10, 11, 11, 11, 11, 12, 12, 12, 13, 13],
'Job': ['Driver', 'Nurse', 'Helper', 'Driver', 'Admin', 'Driver', 'Helper', 'Nurse', 'Driver', 'Helper'],
'Location': ['Peru street', 'Jorge street', 'Jorge street', 'Phoenix av.', 'Jorge street', 'Yard av.', 'Yard av.', 'Jorge street', 'Peru street', 'Jorge street'],
'Main Job?': ['Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No']
})
# Filter Main Locations
main_jobs = df.loc[df['Main Job?'] == 'Yes'][['ID', 'Location']].rename(columns = {'Location': 'Main Location'})
# Add Main Location Column
df = df.merge(main_jobs, on=['ID'], how='left')
# Build Primary Location Column
df['Primary Location?'] = df['Location'] == df['Main Location']
# Convert True/False to String
df['Primary Location?'] = np.where(df['Primary Location?'], 'Yes', 'No')
# Select Relevant columns
df = df[['ID', 'Job', 'Location', 'Main Job?', 'Primary Location?']]
print(df)
Outputs:
>>>
ID Job Location Main Job? Primary Location?
0 10 Driver Peru street Yes Yes
1 11 Nurse Jorge street No Yes
2 11 Helper Jorge street Yes Yes
3 11 Driver Phoenix av. No No
4 11 Admin Jorge street No Yes
5 12 Driver Yard av. Yes Yes
6 12 Helper Yard av. No Yes
7 12 Nurse Jorge street No No
8 13 Driver Peru street Yes Yes
9 13 Helper Jorge street No No
Hope it helps! :)