Calculate average daily amount of time between two news posted by a unique Source-CodePudding

I have a pandas dataframe (I have simplified table with one date showing in example), I want to calculate the average daily amount of time between two news posted by a unique Source

Input

source          date           time     
Investing.com   2017-05-11     08:00:00     
Investing.com   2017-05-11     12:00:00
Investing.com   2017-05-11     16:00:00 
yahoo.com       2017-05-11     09:00:00 
yahoo.com       2017-05-11     12:00:00
yahoo.com       2017-05-11     15:00:00
yahoo.com       2017-05-12     06:00:00 
yahoo.com       2017-05-12     12:00:00
yahoo.com       2017-05-12     18:00:00

Desired_output

source          date           Average_Daily_time   
Investing.com   2017-05-11     04:00:00      
yahoo.com       2017-05-11     03:00:00
yahoo.com       2017-05-12     06:00:00

My Attempt

I merged the datetime in one timestamp and called it datetime

df.sort_values('datetime').groupby('source')['datetime'].apply(lambda x: x.diff().dt.seconds.mean()/60)

Issue

It calculates average time for all dates combined, not separate dates. How to show average time for separate dates?

CodePudding user response：

Convert the time column to timedelta, then group the dataframe by source and date and aggregate time using a lambada function to calculate the mean of diff between rows

df['time'] = pd.to_timedelta(df['time'])
(
    df.groupby(['source', 'date'])['time']
      .agg(lambda d: d.diff().mean()).reset_index(name='avg')
)

          source        date             avg
0  Investing.com  2017-05-11 0 days 04:00:00
1      yahoo.com  2017-05-11 0 days 03:00:00
2      yahoo.com  2017-05-12 0 days 06:00:00

CodePudding user response：

Data 'date' and 'time' divided into separate columns. I also create a datetime column. As a result, the dataframe looks like this:

          source        date     time             datetime
0  Investing.com  2017-05-11  08:00:00 2017-05-11 08:00:00
1  Investing.com  2017-05-11  12:00:00 2017-05-11 12:00:00
2  Investing.com  2017-05-11  16:00:00 2017-05-11 16:00:00
3      yahoo.com  2017-05-11  09:00:00 2017-05-11 09:00:00
4      yahoo.com  2017-05-11  12:00:00 2017-05-11 12:00:00
5      yahoo.com  2017-05-11  15:00:00 2017-05-11 15:00:00
6      yahoo.com  2017-05-12  06:00:00 2017-05-12 06:00:00
7      yahoo.com  2017-05-12  12:00:00 2017-05-12 12:00:00
8      yahoo.com  2017-05-12  18:00:00 2017-05-12 18:00:00

Next, I create a function my_func and save the grouping results to a dataframe. I reset multi-indexes, delete the extra column and rename the column with the result. It turned out a little ornate, maybe someone will do it easier.

import pandas as pd

df['datetime'] = df['date'].str.cat(df['time '], sep =" ")
df['datetime'] = pd.to_datetime(df['datetime'])

def my_func(x):
    result = str(df.loc[x.index,'datetime'].diff().mean())[7:]
    return result

df1 = pd.DataFrame(df.groupby(['source','date'])['date'].apply(my_func))

df1 = df1.stack(0).reset_index()
df1 = df1.drop(columns='level_2')
df1.rename(columns={0: 'Average_Daily_time'}, inplace=True)

print(df1)

Output

          source        date Average_Daily_time
0  Investing.com  2017-05-11           04:00:00
1      yahoo.com  2017-05-11           03:00:00
2      yahoo.com  2017-05-12           06:00:00