Calculate Year wise age in Pandas-CodePudding

Let's say I have an Employees Table and yearly survey filled by each person. I have to transform transactional data into prediction data year wise.

Available Data:

E_ID	TestYear	DateOfBirth
1	2010	1947-01-01
1	2011	1947-01-01
1	2012	1947-01-01
2	2010	1990-01-01
3	2011	1999-01-01
4	2011	1991-01-01
4	2012	1991-01-01
5	2010	1989-01-01
5	2011	1989-01-01
5	2012	1989-01-01
5	2013	1989-01-01

DataFrame I need:

E_ID	Year	Age
1	2010	63
1	2011	64
1	2012	65
2	2010	20
2	2011	21
2	2012	22
3	2010	11
3	2011	12
3	2012	13
4	2010	19
4	2011	20
4	2012	21
5	2010	21
5	2011	22
5	2012	23

In the new df I need all employees, for all 3 years 2010, 2011, 2022 and their relevant ages in the year 2010, 2011, 2022 respectively.

How to achieve this? Since in the transactional data, I have records for some employees for some years and not for other years.

CodePudding user response：

You can produce a Series of the birth years based on a substring of the DateOfBirth col. You can then use subtraction between that series and the TestYear series to get the age -- both Series originate from the same DataFrame, so they have the same size and order.

dob_years = df['DateOfBirth'].str[:4].astype(int)
df['Age'] = df['TestYear'] - dob_years

CodePudding user response：

Create MultiIndex by MultiIndex.from_product from unique E_ID values and list of years, then filter and append missing years by DataFrame.reindex, replace missing values per ID by GroupBy.transform and last subtract years:

y = [2010, 2011, 2012]

mux = pd.MultiIndex.from_product([df['E_ID'].unique(), y], names=['E_ID','TestYear'])
df = df.set_index(['E_ID','TestYear']).reindex(mux).reset_index()
df['DateOfBirth'] = pd.to_datetime(df.groupby('E_ID')['DateOfBirth'].transform('first'))
     
df['Age'] = df['TestYear'].sub(df['DateOfBirth'].dt.year)
print (df)
    E_ID  TestYear DateOfBirth  Age
0      1      2010  1947-01-01   63
1      1      2011  1947-01-01   64
2      1      2012  1947-01-01   65
3      2      2010  1990-01-01   20
4      2      2011  1990-01-01   21
5      2      2012  1990-01-01   22
6      3      2010  1999-01-01   11
7      3      2011  1999-01-01   12
8      3      2012  1999-01-01   13
9      4      2010  1991-01-01   19
10     4      2011  1991-01-01   20
11     4      2012  1991-01-01   21
12     5      2010  1989-01-01   21
13     5      2011  1989-01-01   22
14     5      2012  1989-01-01   23

CodePudding user response：

Since your employer Id is unique and it date of birth is also unique you can groupby the employer id and get the date of birth.

For the aggregation functions in the TestYear you include a list with the years you want to extract the age and and in DateOfBirth you can aggregate with a list, since the values of the list are the same (identic date of birth) you get the first entry:

df = df.groupby('E_ID').agg({"TestYear": lambda x: [2010, 2011, 2012], 
                        'DateOfBirth': lambda x: list(x)[0]}).explode("TestYear")

df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])

df['Age'] = df['TestYear'] - df['DateOfBirth'].dt.year

output

    TestYear    DateOfBirth     Age
E_ID            
1   2010    1947-01-01  63
1   2011    1947-01-01  64
1   2012    1947-01-01  65
2   2010    1990-01-01  20
2   2011    1990-01-01  21
2   2012    1990-01-01  22
3   2010    1999-01-01  11
3   2011    1999-01-01  12
3   2012    1999-01-01  13
4   2010    1991-01-01  19
4   2011    1991-01-01  20
4   2012    1991-01-01  21
5   2010    1989-01-01  21
5   2011    1989-01-01  22
5   2012    1989-01-01  23