The "Age" feature in the following dataframe is corrupted in the sense that given a fixed User_ID, the age is the same for all the "Date". I would like to subtract the difference in years between the date and the date that occurred for the last time from the original age.
import pandas as pd
df = pd.DataFrame({
"User_ID": [ "N1", "N2", "N3", "N1", "N1", "N2", "N3", "N2" , "N1", "N1", "N1", "N2"],
"Date": [ "31/10/2021", "31/10/2020" , "31/10/2019", "24/10/2019", "22/10/2018", "15/10/2017", "14/10/2017", "13/10/2016", "12/10/2016", "11/10/2015", "2/10/2015", "1/10/2015" ],
"Age": [6,5,8,6,6,5,8,5,6,6,6,5]
})
Hence for the dataframe
ID Date Age
0 N1 2021-10-31 6
1 N2 2020-10-31 5
2 N3 2019-10-31 8
3 N1 2019-10-24 6
4 N1 2018-10-28 6
5 N2 2017-10-15 5
6 N3 2017-10-14 8
7 N2 2016-10-13 5
8 N1 2016-10-12 6
9 N1 2015-10-11 6
10 N1 2015-10-2 6
11 N2 2015-10-1 5
the outcome should look like
ID Date Age
0 N1 2021-10-31 6
1 N2 2020-10-31 5
2 N3 2019-10-31 8
3 N1 2019-10-24 4
4 N1 2018-10-28 3
5 N2 2017-10-15 2
6 N3 2017-10-14 6
7 N2 2016-10-13 1
8 N1 2016-10-12 1
9 N1 2015-10-11 0
10 N1 2015-10-2 0
11 N2 2015-10-1 0
Is there any fast way to do that?
CodePudding user response:
You can create Series
by years, get differencies by first year
in GroupBy.first
with GroupBy.transform
with original y
and used for subtract by column Age
:
y = df['Date'].dt.year
df['Age'] = df['Age'].sub(y.groupby(df['User_ID']).transform('first').sub(y))
print (df)
User_ID Date Age
0 N1 2021-10-31 6
1 N2 2020-10-31 5
2 N3 2019-10-31 8
3 N1 2019-10-24 4
4 N1 2018-10-22 3
5 N2 2017-10-15 2
6 N3 2017-10-14 6
7 N2 2016-10-13 1
8 N1 2016-12-10 1
9 N1 2015-11-10 0
10 N1 2015-02-10 0
11 N2 2015-01-10 0