Home > OS >  How to get median between dates in a datetime series pandas
How to get median between dates in a datetime series pandas

Time:03-19

Given that I have a pandas dataframe:

                              waterflow_id                       created_at
0     5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-20 13:19:21.430816 00:00
1     5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819 00:00
2     5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819 00:00
3     5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821 00:00
4     5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821 00:00

How do I get the median of days between created_at so that I can have a dataframe of days in between waterflow ids having something like:

    waterflow     days_median   
        1             0        
        2             4        
        3             6         
        4             7      
        5             10        

Basically here, waterflow represents the unique occurrence of waterflow_id's

With the latest answer I tried

meddata = waterflow_df.groupby("waterflow_id")['created_at'].apply(lambda s: s.diff().median())
print(meddata)

And I recieved:

waterflow_id
0788a658-06d9-4b61-9ac4-2728ace02a86   0 days
1f8752f8-f667-44ec-84b9-acad02d384c0   0 days
2655b525-8b2c-4a53-abdc-5208cb95d96e   0 days
8d3cd7e3-900c-4996-b202-f66eb41ac37b   0 days
9d02b939-f295-4d36-8f72-e9984a52dbd9   0 days
d8d8fb70-d755-48c3-8c19-8032864719da   0 days
dc1da5e1-6974-4145-a0d8-615e08506ebf   0 days
f39366f5-c9e2-415a-baec-530bb8bd2f07   0 days

Whats weird is that I have dates spanning up to 6 months.

CodePudding user response:

The output is unclear, but IIUC, you could use a GroupBy.agg:

from itertools import count
c = count(1)
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df
   .groupby('waterflow_id')
   .agg(**{'waterflow': ('waterflow_id', lambda s: next(c)),
           'days_median': ('created_at', lambda s: s.diff().median()
                                                    .total_seconds()//(3600*24))
          })
)

or using factorize to number the groups:

df['created_at'] = pd.to_datetime(df['created_at'])
(df.assign(waterflow_id=df['waterflow_id'].factorize()[0] 1)
   .groupby('waterflow_id')
   .agg(**{'waterflow': ('waterflow_id', 'first'),
           'days_median': ('created_at', lambda s: s.diff().median()
                                                    .total_seconds()//(3600*24))
          })
)

output:

                                      waterflow  days_median
waterflow_id                                                
5ff86588-594e-458f-9d29-385ee2e128e4          1          0.0

Simple version with just the median:

df['created_at'] = pd.to_datetime(df['created_at'])
out = (df.groupby('waterflow_id')['created_at']
         .apply(lambda s: s.diff().median()
                           .total_seconds()//(3600*24))
      )

output:

waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4    0.0
Name: created_at, dtype: float64
  • Related