Home > Mobile >  Calculate average daily amount of time between two news posted by a unique Source
Calculate average daily amount of time between two news posted by a unique Source

Time:06-23

I have a pandas dataframe (I have simplified table with one date showing in example), I want to calculate the average daily amount of time between two news posted by a unique Source

Input

source          date           time     
Investing.com   2017-05-11     08:00:00     
Investing.com   2017-05-11     12:00:00
Investing.com   2017-05-11     16:00:00 
yahoo.com       2017-05-11     09:00:00 
yahoo.com       2017-05-11     12:00:00
yahoo.com       2017-05-11     15:00:00
yahoo.com       2017-05-12     06:00:00 
yahoo.com       2017-05-12     12:00:00
yahoo.com       2017-05-12     18:00:00  

Desired_output

source          date           Average_Daily_time   
Investing.com   2017-05-11     04:00:00      
yahoo.com       2017-05-11     03:00:00
yahoo.com       2017-05-12     06:00:00 

My Attempt

I merged the datetime in one timestamp and called it datetime

df.sort_values('datetime').groupby('source')['datetime'].apply(lambda x: x.diff().dt.seconds.mean()/60)

Issue

It calculates average time for all dates combined, not separate dates. How to show average time for separate dates?

CodePudding user response:

Convert the time column to timedelta, then group the dataframe by source and date and aggregate time using a lambada function to calculate the mean of diff between rows

df['time'] = pd.to_timedelta(df['time'])
(
    df.groupby(['source', 'date'])['time']
      .agg(lambda d: d.diff().mean()).reset_index(name='avg')
)

          source        date             avg
0  Investing.com  2017-05-11 0 days 04:00:00
1      yahoo.com  2017-05-11 0 days 03:00:00
2      yahoo.com  2017-05-12 0 days 06:00:00

CodePudding user response:

Data 'date' and 'time' divided into separate columns. I also create a datetime column. As a result, the dataframe looks like this:

          source        date     time             datetime
0  Investing.com  2017-05-11  08:00:00 2017-05-11 08:00:00
1  Investing.com  2017-05-11  12:00:00 2017-05-11 12:00:00
2  Investing.com  2017-05-11  16:00:00 2017-05-11 16:00:00
3      yahoo.com  2017-05-11  09:00:00 2017-05-11 09:00:00
4      yahoo.com  2017-05-11  12:00:00 2017-05-11 12:00:00
5      yahoo.com  2017-05-11  15:00:00 2017-05-11 15:00:00
6      yahoo.com  2017-05-12  06:00:00 2017-05-12 06:00:00
7      yahoo.com  2017-05-12  12:00:00 2017-05-12 12:00:00
8      yahoo.com  2017-05-12  18:00:00 2017-05-12 18:00:00

Next, I create a function my_func and save the grouping results to a dataframe. I reset multi-indexes, delete the extra column and rename the column with the result. It turned out a little ornate, maybe someone will do it easier.

import pandas as pd

df['datetime'] = df['date'].str.cat(df['time '], sep =" ")
df['datetime'] = pd.to_datetime(df['datetime'])

def my_func(x):
    result = str(df.loc[x.index,'datetime'].diff().mean())[7:]
    return result

df1 = pd.DataFrame(df.groupby(['source','date'])['date'].apply(my_func))

df1 = df1.stack(0).reset_index()
df1 = df1.drop(columns='level_2')
df1.rename(columns={0: 'Average_Daily_time'}, inplace=True)

print(df1)

Output

          source        date Average_Daily_time
0  Investing.com  2017-05-11           04:00:00
1      yahoo.com  2017-05-11           03:00:00
2      yahoo.com  2017-05-12           06:00:00
  • Related