I have a dataset containing the employee id, age, and experience, I need to calcaute the age and experience based on the logic
Input dataframe
emp_id | name | age | exp | YEAR |
---|---|---|---|---|
101 | Rock | 20 | 5 | 2000 |
101 | Rock | 20 | 5 | 2001 |
101 | Rock | 20 | 5 | 2002 |
102 | DED | 25 | 10 | 2000 |
102 | DED | 25 | 10 | 2001 |
102 | DED | 25 | 10 | 2002 |
103 | FEG | 30 | 35 | 2000 |
103 | FEG | 30 | 35 | 2001 |
103 | FEG | 30 | 35 | 2002 |
Expected output
emp_id | name | age | exp | YEAR | update_age | upadted_exp |
---|---|---|---|---|---|---|
101 | Rock | 20 | 5 | 2000 | 18 | 3 |
101 | Rock | 20 | 5 | 2001 | 19 | 4 |
101 | Rock | 20 | 5 | 2002 | 20 | 5 |
102 | DED | 25 | 10 | 2000 | 23 | 8 |
102 | DED | 25 | 10 | 2001 | 24 | 9 |
102 | DED | 25 | 10 | 2002 | 25 | 10 |
103 | FEG | 30 | 35 | 2000 | 28 | 33 |
103 | FEG | 30 | 35 | 2001 | 29 | 34 |
103 | FEG | 30 | 35 | 2002 | 30 | 35 |
Explanation
Logic 1- If the employee has 3 records with age and experience, needs to subtract the last record values from the above 2. Emp_id is unique.
Second Module
I have below data frame based on the max year I need to update the year column. ie - year emp_id 101 have 3 records, max year = 2001 so output column contains 2001,2000,1999
emp_id | name | year |
---|---|---|
101 | Rock | 2000 |
101 | Rock | 2000 |
101 | Rock | 2001 |
102 | DED | 2002 |
102 | DED | 2002 |
102 | DED | 2000 |
103 | FEG | 2000 |
103 | FEG | 2000 |
103 | FEG | 2000 |
Output
emp_id | name | year | upadte_year |
---|---|---|---|
101 | Rock | 2000 | 1999 |
101 | Rock | 2000 | 2000 |
101 | Rock | 2001 | 2001 |
102 | DED | 2002 | 2000 |
102 | DED | 2002 | 2001 |
102 | DED | 2000 | 2002 |
103 | FEG | 2000 | 1998 |
103 | FEG | 2000 | 1999 |
103 | FEG | 2000 | 2000 |
CodePudding user response:
You can achieve the desired output as follows. Use df.groupby
with emp_id
and retrieve the cumcount
for each group in descending order. Next, subtract the result from cols age
and exp
to get your updated columns.
import pandas as pd
data = {'emp_id': {0: 101, 1: 101, 2: 101, 3: 102, 4: 102, 5: 102,
6: 103, 7: 103, 8: 103},
'name': {0: 'Rock', 1: 'Rock', 2: 'Rock', 3: 'DED', 4: 'DED',
5: 'DED', 6: 'FEG', 7: 'FEG', 8: 'FEG'},
'age': {0: 20, 1: 20, 2: 20, 3: 25, 4: 25, 5: 25, 6: 30,
7: 30, 8: 30},
'exp': {0: 5, 1: 5, 2: 5, 3: 10, 4: 10, 5: 10, 6: 35,
7: 35, 8: 35},
'YEAR': {0: 2000, 1: 2001, 2: 2002, 3: 2000, 4: 2001,
5: 2002, 6: 2000, 7: 2001, 8: 2002}
}
df = pd.DataFrame(data)
cumcount = df.groupby('emp_id').cumcount(ascending=False)
df['updated_age'] = df['age'].sub(cumcount)
df['updated_exp'] = df['exp'].sub(cumcount)
print(df)
emp_id name age exp YEAR updated_age updated_exp
0 101 Rock 20 5 2000 18 3
1 101 Rock 20 5 2001 19 4
2 101 Rock 20 5 2002 20 5
3 102 DED 25 10 2000 23 8
4 102 DED 25 10 2001 24 9
5 102 DED 25 10 2002 25 10
6 103 FEG 30 35 2000 28 33
7 103 FEG 30 35 2001 29 34
8 103 FEG 30 35 2002 30 35
Update: as for the second question added later: to add the column updated_year
to the second df
based on the max
in column year
for each emp_id
, try as follows:
df['updated_year'] = df.groupby('emp_id')['year'].transform(max)\
.sub(df.groupby('emp_id').cumcount(ascending=False))
print(df)
emp_id name year updated_year
0 101 Rock 2000 1999
1 101 Rock 2000 2000
2 101 Rock 2001 2001
3 102 DED 2002 2000
4 102 DED 2002 2001
5 102 DED 2000 2002
6 103 FEG 2000 1998
7 103 FEG 2000 1999
8 103 FEG 2000 2000
CodePudding user response:
My assumption is that you want to correct age
and exp
based on the actual year-differences (I might be wrong, though). If that's the case, you could try:
diff = df["YEAR"] - df.groupby("emp_id")["YEAR"].transform("max")
df["update_age"] = df["age"] diff
df["update_exp"] = df["exp"] diff
- Build a
diff
series that contains the difference in years betweenyear
and the maximum year perempty_id
. - Correct
age
andexp
usingdiff
.
Result for the sample (corrected):
emp_id name age exp YEAR update_age update_exp
0 101 Rock 20 5 2000 18 3
1 101 Rock 20 5 2001 19 4
2 101 Rock 20 5 2002 20 5
3 102 DED 25 10 2000 23 8
4 102 DED 25 10 2001 24 9
5 102 DED 25 10 2002 25 10
6 103 FEG 30 35 2000 28 33
7 103 FEG 30 35 2001 29 34
8 103 FEG 30 35 2002 30 35