I have a data frame named "table" which looks like below
Date ID
0 2015-12-30 2658789.0
1 2017-04-12 2658789.0
2 2014-06-23 2658789.0
3 2015-12-30 2658789.0
4 2016-01-06 2658789.0
5 2012-10-15 2658731.0
6 2011-04-14 2660120.0
7 2015-09-28 2660120.0
8 2017-02-10 2660120.0
9 2016-12-12 2660649.0
10 2017-05-03 2660649.0
11 2016-05-25 2660649.0
12 2015-09-28 2660649.0
13 2014-08-11 2660649.0
14 2016-01-22 2660649.0
print(table)
{'Date': {0: '2015-12-30', 1: '2017-04-12', 2: '2014-06-23', 3: '2015-12-30', 4: '2016-01-06', 5: '2012-10-15', 6: '2011-04-14', 7: '2015-09-28', 8: '2017-02-10', 9: '2016-12-12', 10: '2017-05-03', 11: '2016-05-25', 12: '2015-09-28', 13: '2014-08-11', 14: '2016-01-22'}, 'ID': {0: 2658789.0, 1: 2658789.0, 2: 2658789.0, 3: 2658789.0, 4: 2658789.0, 5: 2658731.0, 6: 2660120.0, 7: 2660120.0, 8: 2660120.0, 9: 2660649.0, 10: 2660649.0, 11: 2660649.0, 12: 2660649.0, 13: 2660649.0, 14: 2660649.0}}
The types of 2 columns in table
table.dtypes
Date object
ID float64
dtype: object
I need to count the average of follow-up times for each ID.
For each ID, follow-up times are defined as the time period (in days) between
the latest date and the first date
the second-latest date and the first date
the third-latest date and the first date
etc.
I wrote script for more simple situation (please see below), in which the follow up time is just the time period between the earliest and the latest date for each ID. The problem with my script is that as it iterated over rows, it took much time to run
# Group and get distinct Date by ID
table = table['Date'].groupby(table['ID']).unique().reset_index()
# Get the latest and earliest Dates of each ID
from datetime import datetime
result = []
for index in range(len(table)):
list_of_dates= [datetime.strptime(date,"%Y-%m-%d") for date in table.iloc[index,1]]
n = table.iloc[index,0]
x = max(list_of_dates)
y = min(list_of_dates)
result.append([n, x, y])
df = pd.DataFrame(result, columns=['ID', 'LatestDate', 'EarliestDate'])
print(df)
# Count Follow-up time
df.loc[:,'LatestDate':'EarliestDate'] = df.loc[:,'LatestDate':'EarliestDate'].astype('datetime64[ns]')
df['FollowUpPeriod'] = (df['LatestDate'] - df['EarliestDate']).dt.days
df['FollowUpPeriod'].mean()
Could anyone help me with the new situation and in a more efficient way?
Any help would be greatly appreciated!!
CodePudding user response:
Is this what you need?
import io
import pandas as pd
data = """Date,ID
2015-12-30,2658789.0
2017-04-12,2658789.0
2014-06-23,2658789.0
2015-12-30,2658789.0
2016-01-06,2658789.0
2012-10-15,2658731.0
2011-04-14,2660120.0
2015-09-28,2660120.0
2017-02-10,2660120.0
2016-12-12,2660649.0
2017-05-03,2660649.0
2016-05-25,2660649.0
2015-09-28,2660649.0
2014-08-11,2660649.0
2016-01-22,2660649.0
"""
table = pd.read_csv(io.StringIO(data))
table.Date = pd.to_datetime(table.Date)
table = table.sort_values(by=['Date', 'ID'], ignore_index=True)
df = table.groupby('ID').Date.unique().reset_index()
df['avg'] = df.Date.map(lambda x: sum([x[-1] - d for d in x]) / len(x))
print(df.loc[:, ['ID', 'avg']])
ID avg
0 2658731.0 0 days 00:00:00
1 2658789.0 488 days 18:00:00
2 2660120.0 876 days 16:00:00
3 2660649.0 421 days 20:00:00
CodePudding user response:
import pandas as pd
data = {'Date': {0: '2015-12-30', 1: '2017-04-12', 2: '2014-06-23', 3: '2015-12-30', 4: '2016-01-06', 5: '2012-10-15', 6: '2011-04-14', 7: '2015-09-28', 8: '2017-02-10', 9: '2016-12-12', 10: '2017-05-03', 11: '2016-05-25', 12: '2015-09-28', 13: '2014-08-11', 14: '2016-01-22'}, 'ID': {0: 2658789.0, 1: 2658789.0, 2: 2658789.0, 3: 2658789.0, 4: 2658789.0, 5: 2658731.0, 6: 2660120.0, 7: 2660120.0, 8: 2660120.0, 9: 2660649.0, 10: 2660649.0, 11: 2660649.0, 12: 2660649.0, 13: 2660649.0, 14: 2660649.0}}
df = pd.DataFrame.from_dict(data)
df.Date = pd.to_datetime(df.Date)
(
df
.set_index('ID')
.get('Date')
.sort_values()
.groupby(level=0)
.apply(lambda s: (s.iloc[1:] - s.iloc[0]).drop_duplicates().mean())
.fillna(pd.Timedelta(0))
)
Output:
ID
2658731.0 0 days 00:00:00
2658789.0 713 days 16:00:00
2660120.0 1878 days 12:00:00
2660649.0 689 days 00:00:00
Name: Date, dtype: timedelta64[ns]
Let's check it manually for ID = 2658789.0 to be sure we are on the right way:
df.query('ID==2658789')['Date'].drop_duplicates().sort_values()
2 2014-06-23
0 2015-12-30
4 2016-01-06
1 2017-04-12
So we have to calculate the mean of three differences:
(
(2015-12-30 - 2014-06-23)
(2016-01-06 - 2014-06-23)
(2017-04-12 - 2014-06-23)
) / 3
= (555 days 562 days 1024 days)/3
= 713.6(6) days (0.66... days = 24 * 0.66... hours = 16 hours)