Given that I have a pandas dataframe:
waterflow_id created_at
0 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-20 13:19:21.430816 00:00
1 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819 00:00
2 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819 00:00
3 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821 00:00
4 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821 00:00
How do I get the median of days between created_at
so that I can have a dataframe of days in between waterflow ids having something like:
waterflow days_median
1 0
2 4
3 6
4 7
5 10
Basically here, waterflow represents the unique occurrence of waterflow_id
's
With the latest answer I tried
meddata = waterflow_df.groupby("waterflow_id")['created_at'].apply(lambda s: s.diff().median())
print(meddata)
And I recieved:
waterflow_id
0788a658-06d9-4b61-9ac4-2728ace02a86 0 days
1f8752f8-f667-44ec-84b9-acad02d384c0 0 days
2655b525-8b2c-4a53-abdc-5208cb95d96e 0 days
8d3cd7e3-900c-4996-b202-f66eb41ac37b 0 days
9d02b939-f295-4d36-8f72-e9984a52dbd9 0 days
d8d8fb70-d755-48c3-8c19-8032864719da 0 days
dc1da5e1-6974-4145-a0d8-615e08506ebf 0 days
f39366f5-c9e2-415a-baec-530bb8bd2f07 0 days
Whats weird is that I have dates spanning up to 6 months.
CodePudding user response:
The output is unclear, but IIUC, you could use a GroupBy.agg
:
from itertools import count
c = count(1)
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', lambda s: next(c)),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
or using factorize
to number the groups:
df['created_at'] = pd.to_datetime(df['created_at'])
(df.assign(waterflow_id=df['waterflow_id'].factorize()[0] 1)
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', 'first'),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
output:
waterflow days_median
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 1 0.0
Simple version with just the median:
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df.groupby('waterflow_id')['created_at']
.apply(lambda s: s.diff().median()
.total_seconds()//(3600*24))
)
output:
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 0.0
Name: created_at, dtype: float64