Home > Mobile >  Pandas how to perform the inner calculation in the dataset?
Pandas how to perform the inner calculation in the dataset?

Time:08-30

I have a dataset containing the employee id, age, and experience, I need to calcaute the age and experience based on the logic

Input dataframe

emp_id name age exp YEAR
101 Rock 20 5 2000
101 Rock 20 5 2001
101 Rock 20 5 2002
102 DED 25 10 2000
102 DED 25 10 2001
102 DED 25 10 2002
103 FEG 30 35 2000
103 FEG 30 35 2001
103 FEG 30 35 2002

Expected output

emp_id name age exp YEAR update_age upadted_exp
101 Rock 20 5 2000 18 3
101 Rock 20 5 2001 19 4
101 Rock 20 5 2002 20 5
102 DED 25 10 2000 23 8
102 DED 25 10 2001 24 9
102 DED 25 10 2002 25 10
103 FEG 30 35 2000 28 33
103 FEG 30 35 2001 29 34
103 FEG 30 35 2002 30 35

Explanation

Logic 1- If the employee has 3 records with age and experience, needs to subtract the last record values from the above 2. Emp_id is unique.

Second Module

I have below data frame based on the max year I need to update the year column. ie - year emp_id 101 have 3 records, max year = 2001 so output column contains 2001,2000,1999

emp_id name year
101 Rock 2000
101 Rock 2000
101 Rock 2001
102 DED 2002
102 DED 2002
102 DED 2000
103 FEG 2000
103 FEG 2000
103 FEG 2000

Output

emp_id name year upadte_year
101 Rock 2000 1999
101 Rock 2000 2000
101 Rock 2001 2001
102 DED 2002 2000
102 DED 2002 2001
102 DED 2000 2002
103 FEG 2000 1998
103 FEG 2000 1999
103 FEG 2000 2000

CodePudding user response:

You can achieve the desired output as follows. Use df.groupby with emp_id and retrieve the cumcount for each group in descending order. Next, subtract the result from cols age and exp to get your updated columns.

import pandas as pd

data = {'emp_id': {0: 101, 1: 101, 2: 101, 3: 102, 4: 102, 5: 102, 
                   6: 103, 7: 103, 8: 103}, 
        'name': {0: 'Rock', 1: 'Rock', 2: 'Rock', 3: 'DED', 4: 'DED', 
                 5: 'DED', 6: 'FEG', 7: 'FEG', 8: 'FEG'}, 
        'age': {0: 20, 1: 20, 2: 20, 3: 25, 4: 25, 5: 25, 6: 30, 
                7: 30, 8: 30}, 
        'exp': {0: 5, 1: 5, 2: 5, 3: 10, 4: 10, 5: 10, 6: 35, 
                7: 35, 8: 35}, 
        'YEAR': {0: 2000, 1: 2001, 2: 2002, 3: 2000, 4: 2001, 
                 5: 2002, 6: 2000, 7: 2001, 8: 2002}
        }

df = pd.DataFrame(data)

cumcount = df.groupby('emp_id').cumcount(ascending=False)

df['updated_age'] = df['age'].sub(cumcount)
df['updated_exp'] = df['exp'].sub(cumcount)

print(df)

   emp_id  name  age  exp  YEAR  updated_age  updated_exp
0     101  Rock   20    5  2000           18            3
1     101  Rock   20    5  2001           19            4
2     101  Rock   20    5  2002           20            5
3     102   DED   25   10  2000           23            8
4     102   DED   25   10  2001           24            9
5     102   DED   25   10  2002           25           10
6     103   FEG   30   35  2000           28           33
7     103   FEG   30   35  2001           29           34
8     103   FEG   30   35  2002           30           35

Update: as for the second question added later: to add the column updated_year to the second df based on the max in column year for each emp_id, try as follows:

df['updated_year'] = df.groupby('emp_id')['year'].transform(max)\
    .sub(df.groupby('emp_id').cumcount(ascending=False))

print(df)

   emp_id  name  year  updated_year
0     101  Rock  2000          1999
1     101  Rock  2000          2000
2     101  Rock  2001          2001
3     102   DED  2002          2000
4     102   DED  2002          2001
5     102   DED  2000          2002
6     103   FEG  2000          1998
7     103   FEG  2000          1999
8     103   FEG  2000          2000

CodePudding user response:

My assumption is that you want to correct age and exp based on the actual year-differences (I might be wrong, though). If that's the case, you could try:

diff = df["YEAR"] - df.groupby("emp_id")["YEAR"].transform("max")
df["update_age"] = df["age"]   diff
df["update_exp"] = df["exp"]   diff
  • Build a diff series that contains the difference in years between year and the maximum year per empty_id.
  • Correct age and exp using diff.

Result for the sample (corrected):

   emp_id  name  age  exp  YEAR  update_age  update_exp
0     101  Rock   20    5  2000          18           3
1     101  Rock   20    5  2001          19           4
2     101  Rock   20    5  2002          20           5
3     102   DED   25   10  2000          23           8
4     102   DED   25   10  2001          24           9
5     102   DED   25   10  2002          25          10
6     103   FEG   30   35  2000          28          33
7     103   FEG   30   35  2001          29          34
8     103   FEG   30   35  2002          30          35
  • Related