I have a big dataframe indexed by date that contains info from employees:
DATE USER_ID POSITION_ID SALARY
2019-12-01 101 A1 1000
2019-12-01 103 A2 500
2019-12-01 105 C2 15000
2020-01-01 101 B1 1100
2020-01-01 103 A2 500
2019-12-01 105 C3 15000
I want to create a new column that monthly traces whether an employee got a promotion or just simply changed of position. Something like that:
DATE USER_ID POSITION_ID SALARY EMPLOYEE_MOVEMENT
2019-12-01 101 A1 1000 NONE
2019-12-01 103 A2 500 NONE
2019-12-01 105 C2 15000 NONE
2020-01-01 101 B1 1100 PROMOTION
2020-01-01 103 A2 500 NONE
2020-01-01 105 C2 16000 RAISE
So far I have reached the desired output by slicing my dataframe by months, df1 contains December 2019 and df2 contains January 2021 and merging them again to create new columns for their position and salry by month.
USER_ID DEC_POSITION_ID DEC_SALARY JAN_POSITION_ID JAN_SALARY
101 A1 1000 B1 1100
103 A2 500 A2 500
105 C2 16000 C2 15000
Then I am able to make a comparison between the two dataframes:
conditions=[(df["JAN_POSITION_ID"] == df["DEC_POSITION_ID"]) & (df["JAN_SALARY"] == df["DEC_SALARY"]),
(df["JAN_POSITION_ID"] != df["DEC_POSITION_ID"]) & (df["JAN_SALARY"] > df["DEC_SALARY"]),
(df["JAN_POSITION_ID"] == df["DEC_POSITION_ID"]) & (df["JAN_SALARY"] > df["DEC_SALARY"])]
values=["NONE", "PROMOTION", "RAISE"]
df["MOVEMENT"] = np.select(conditions, values)
Then I get the following output:
USER_ID DEC_POSITION_ID DEC_SALARY JAN_POSITION_ID JAN_SALARY MOVEMENT
101 A1 1000 B1 1100 PROMOTION
103 A2 500 A2 500 NONE
105 C2 16000 C2 15000 RAISE
But since it is a huge data set (three years of info) I want to trace the monthy movements (each month compared to the previous one) of employees without slicing my dataframe and get something like I have already mentioned:
DATE USER_ID POSITION_ID SALARY MOVEMENT
2019-12-01 101 A1 1000 NONE
2019-12-01 103 A2 500 NONE
2019-12-01 105 C2 15000 NONE
2020-01-01 101 B1 1100 PROMOTION
2020-01-01 103 A2 500 NONE
2020-01-01 105 C2 16000 RAISE
Any thoughts on that? Thank you so much for your help!
CodePudding user response:
Here's a solution but it only works if you have every combination observation of month/user-id
import pandas as pd
#Test table
df = pd.DataFrame({
'DATE': ['2019-12-01','2019-12-01','2019-12-01','2020-01-01','2020-01-01','2020-01-01'],
'USER_ID': [101, 103, 105, 101, 103, 105],
'POSITION_ID': ['A1', 'A2', 'C2', 'B1', 'A2', 'C3'],
'SALARY': [1000, 500, 15000, 1100, 500, 15000]
})
df['DATE'] = pd.to_datetime(df['DATE'])
#Groupby and shift to compare the position and salary month to month
prev_salary = df.groupby('USER_ID')['SALARY'].shift(1)
prev_position = df.groupby('USER_ID')['POSITION_ID'].shift(1)
changed_positions = prev_position.notnull() & df['POSITION_ID'].ne(prev_position)
pay_raise = prev_salary.notnull() & df['SALARY'].gt(prev_salary)
#Create the MOVEMENT column
df.loc[changed_positions & ~pay_raise,'MOVEMENT'] = 'Changed position without raise'
df.loc[changed_positions & pay_raise,'MOVEMENT'] = 'Changed position with raise'
df.loc[~changed_positions & pay_raise,'MOVEMENT'] = 'Same position with raise'
df