By having the following table, how can I count the days by ID? without use of for or any loop because it's large size data.
ID Date
a 01/01/2020
a 05/01/2020
a 08/01/2020
a 10/01/2020
b 05/05/2020
b 08/05/2020
b 12/05/2020
c 08/08/2020
c 22/08/2020
to have this result
ID Date Days Evolved Since Inicial date
a 01/01/2020 1
a 05/01/2020 4
a 08/01/2020 7
a 10/01/2020 9
b 05/05/2020 1
b 08/05/2020 3
b 12/05/2020 7
c 08/08/2020 1
c 22/08/2020 14
CodePudding user response:
You could do something like (df
your dataframe):
def days_evolved(sdf):
sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
return sdf
df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)
Result for the sample:
ID Date Days_evolved
0 a 2020-01-01 1 days
1 a 2020-01-05 4 days
2 a 2020-01-08 7 days
3 a 2020-01-10 9 days
4 b 2020-05-05 1 days
5 b 2020-05-08 3 days
6 b 2020-05-12 7 days
7 c 2020-08-08 1 days
8 c 2020-08-22 14 days
If you want int
instead of pd.Timedelta
then do
df["Days_evolved"] = df["Days_evolved"].dt.days
at the end.
CodePudding user response:
Use GroupBy.transform
with GroupBy.first
for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0
:
df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
Or set 1
for first value of group by Series.where
and Series.duplicated
:
df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
.dt.days.where(df['ID'].duplicated(), 1))
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14