How to count date difference in python-CodePudding

I have a data frame named "table" which looks like below

       Date         ID
0   2015-12-30  2658789.0
1   2017-04-12  2658789.0
2   2014-06-23  2658789.0
3   2015-12-30  2658789.0
4   2016-01-06  2658789.0
5   2012-10-15  2658731.0
6   2011-04-14  2660120.0
7   2015-09-28  2660120.0
8   2017-02-10  2660120.0
9   2016-12-12  2660649.0
10  2017-05-03  2660649.0
11  2016-05-25  2660649.0
12  2015-09-28  2660649.0
13  2014-08-11  2660649.0
14  2016-01-22  2660649.0

print(table)
{'Date': {0: '2015-12-30', 1: '2017-04-12', 2: '2014-06-23', 3: '2015-12-30', 4: '2016-01-06', 5: '2012-10-15', 6: '2011-04-14', 7: '2015-09-28', 8: '2017-02-10', 9: '2016-12-12', 10: '2017-05-03', 11: '2016-05-25', 12: '2015-09-28', 13: '2014-08-11', 14: '2016-01-22'}, 'ID': {0: 2658789.0, 1: 2658789.0, 2: 2658789.0, 3: 2658789.0, 4: 2658789.0, 5: 2658731.0, 6: 2660120.0, 7: 2660120.0, 8: 2660120.0, 9: 2660649.0, 10: 2660649.0, 11: 2660649.0, 12: 2660649.0, 13: 2660649.0, 14: 2660649.0}}

The types of 2 columns in table

table.dtypes 
Date     object
ID      float64
dtype: object

I need to count the average of follow-up times for each ID.
For each ID, follow-up times are defined as the time period (in days) between
the latest date and the first date
the second-latest date and the first date
the third-latest date and the first date
etc.

I wrote script for more simple situation (please see below), in which the follow up time is just the time period between the earliest and the latest date for each ID. The problem with my script is that as it iterated over rows, it took much time to run

    # Group and get distinct Date by ID 
    table = table['Date'].groupby(table['ID']).unique().reset_index()
    
    # Get the latest and earliest Dates of each ID
    from datetime import datetime
    result = []
    for index in range(len(table)):
        list_of_dates= [datetime.strptime(date,"%Y-%m-%d") for date in table.iloc[index,1]]
        n = table.iloc[index,0]
        x = max(list_of_dates)
        y = min(list_of_dates)
        result.append([n, x, y])
        df = pd.DataFrame(result, columns=['ID', 'LatestDate', 'EarliestDate'])
        print(df)

    # Count Follow-up time
    df.loc[:,'LatestDate':'EarliestDate'] = df.loc[:,'LatestDate':'EarliestDate'].astype('datetime64[ns]')
    
    df['FollowUpPeriod'] = (df['LatestDate'] - df['EarliestDate']).dt.days
    df['FollowUpPeriod'].mean()

Could anyone help me with the new situation and in a more efficient way?
Any help would be greatly appreciated!!

CodePudding user response：

Is this what you need?

import io
import pandas as pd


data = """Date,ID
2015-12-30,2658789.0
2017-04-12,2658789.0
2014-06-23,2658789.0
2015-12-30,2658789.0
2016-01-06,2658789.0
2012-10-15,2658731.0
2011-04-14,2660120.0
2015-09-28,2660120.0
2017-02-10,2660120.0
2016-12-12,2660649.0
2017-05-03,2660649.0
2016-05-25,2660649.0
2015-09-28,2660649.0
2014-08-11,2660649.0
2016-01-22,2660649.0
"""


table = pd.read_csv(io.StringIO(data))
table.Date = pd.to_datetime(table.Date)
table = table.sort_values(by=['Date', 'ID'], ignore_index=True)


df = table.groupby('ID').Date.unique().reset_index()


df['avg'] = df.Date.map(lambda x: sum([x[-1] - d for d in x]) / len(x))


print(df.loc[:, ['ID', 'avg']])

          ID               avg
0  2658731.0   0 days 00:00:00
1  2658789.0 488 days 18:00:00
2  2660120.0 876 days 16:00:00
3  2660649.0 421 days 20:00:00

CodePudding user response：

import pandas as pd

data = {'Date': {0: '2015-12-30', 1: '2017-04-12', 2: '2014-06-23', 3: '2015-12-30', 4: '2016-01-06', 5: '2012-10-15', 6: '2011-04-14', 7: '2015-09-28', 8: '2017-02-10', 9: '2016-12-12', 10: '2017-05-03', 11: '2016-05-25', 12: '2015-09-28', 13: '2014-08-11', 14: '2016-01-22'}, 'ID': {0: 2658789.0, 1: 2658789.0, 2: 2658789.0, 3: 2658789.0, 4: 2658789.0, 5: 2658731.0, 6: 2660120.0, 7: 2660120.0, 8: 2660120.0, 9: 2660649.0, 10: 2660649.0, 11: 2660649.0, 12: 2660649.0, 13: 2660649.0, 14: 2660649.0}}
df = pd.DataFrame.from_dict(data)
df.Date = pd.to_datetime(df.Date)

(
    df
    .set_index('ID')
    .get('Date')
    .sort_values()
    .groupby(level=0)
    .apply(lambda s: (s.iloc[1:] - s.iloc[0]).drop_duplicates().mean())
    .fillna(pd.Timedelta(0))
)

Output:

ID
2658731.0      0 days 00:00:00
2658789.0    713 days 16:00:00
2660120.0   1878 days 12:00:00
2660649.0    689 days 00:00:00
Name: Date, dtype: timedelta64[ns]

Let's check it manually for ID = 2658789.0 to be sure we are on the right way:

df.query('ID==2658789')['Date'].drop_duplicates().sort_values()

2   2014-06-23
0   2015-12-30
4   2016-01-06
1   2017-04-12

So we have to calculate the mean of three differences:

(
  (2015-12-30 - 2014-06-23) 
  (2016-01-06 - 2014-06-23)
  (2017-04-12 - 2014-06-23)
) / 3 
= (555 days   562 days   1024 days)/3 
= 713.6(6) days    (0.66... days = 24 * 0.66... hours = 16 hours)