This is a simple one but it is stumping me. I have a data frame consisting of day-level observations by individuals. However, not all individuals are observed on the same day. I need to create rows including the days in which individuals are not observed (i.e., count NaN or 0), between the dates where they are present.
The relevant data looks as follows
ID date StartDate EndDate Count Cov1 Cov2 Cov3
A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
A 06/05/2005 04/04/05 06/06/2006 5 1 F 1
A 07/05/2005 04/04/05 06/06/2006 2 1 F 1
A 10/05/2005 04/04/05 06/06/2006 7 1 F 1
B 05/05/2005 04/04/05 06/06/2006 6 0 M 2
B 07/05/2005 04/04/05 06/06/2006 1 0 M 2
C 01/05/2005 04/04/05 06/06/2006 3 1 F 1
C 03/05/2005 04/04/05 06/06/2006 7 1 F 1
However, I need it to look like this:
ID date StartDate EndDate Count Cov1 Cov2 Cov3
A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
A 06/05/2005 04/04/05 06/06/2006 5 1 F 1
A 07/05/2005 04/04/05 06/06/2006 2 1 F 1
A 08/05/2005 04/04/05 06/06/2006 0 1 F 1
A 09/05/2005 04/04/05 06/06/2006 0 1 F 1
A 10/05/2005 04/04/05 06/06/2006 7 1 F 1
B 05/05/2005 04/04/05 06/06/2006 6 0 M 2
B 06/05/2005 04/04/05 06/06/2006 0 0 M 2
B 07/05/2005 04/04/05 06/06/2006 1 0 M 2
C 01/05/2005 04/04/05 06/06/2006 3 1 F 1
C 02/05/2005 04/04/05 06/06/2006 0 1 F 1
C 03/05/2005 04/04/05 06/06/2006 7 1 F 1
So, the data should expand by days not counted in the original but between the start and end dates. However, the count variable must not copy over, while all other covariates have to copy.
CodePudding user response:
Create DatetimeIndex
by column date
first, so possible use custom lambda function with DataFrame.asfreq
, remove first level of MultiIndex and convert index to column date
, last use Series.dt.strftime
for original format DD/MM/YYYY
:
First is possible test duplicated rows by columns ID, date
:
print (df[df.duplicated(['ID','date'], keep=False)])
ID date StartDate EndDate Count Cov1 Cov2 Cov3
0 A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
1 A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
If possible, remove duplicates:
df = df.drop_duplicates(['ID','date'])
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df1 = (df.set_index('date').groupby('ID')
.apply(lambda x: x.asfreq('D', method='ffill'))
.droplevel(0)
.reset_index())
print (df1)
date ID StartDate EndDate Count Cov1 Cov2 Cov3
0 2005-05-05 A 04/04/05 06/06/2006 3 1 F 1
1 2005-05-06 A 04/04/05 06/06/2006 5 1 F 1
2 2005-05-07 A 04/04/05 06/06/2006 2 1 F 1
3 2005-05-08 A 04/04/05 06/06/2006 2 1 F 1
4 2005-05-09 A 04/04/05 06/06/2006 2 1 F 1
5 2005-05-10 A 04/04/05 06/06/2006 7 1 F 1
6 2005-05-05 B 04/04/05 06/06/2006 6 0 M 2
7 2005-05-06 B 04/04/05 06/06/2006 6 0 M 2
8 2005-05-07 B 04/04/05 06/06/2006 1 0 M 2
9 2005-05-01 C 04/04/05 06/06/2006 3 1 F 1
10 2005-05-02 C 04/04/05 06/06/2006 3 1 F 1
11 2005-05-03 C 04/04/05 06/06/2006 7 1 F 1
print (df1.index.name)
None
If possible in real data it is ID
use:
df1 = df1.rename_axis(None)
m = df1.merge(df, indicator=True, how='left')['_merge'].eq('left_only')
df1.loc[m, 'Count'] = 0
df1['date'] = df1['date'].dt.strftime('%d/%m/%Y')
print (df1)
date ID StartDate EndDate Count Cov1 Cov2 Cov3
0 05/05/2005 A 04/04/05 06/06/2006 3 1 F 1
1 06/05/2005 A 04/04/05 06/06/2006 5 1 F 1
2 07/05/2005 A 04/04/05 06/06/2006 2 1 F 1
3 08/05/2005 A 04/04/05 06/06/2006 0 1 F 1
4 09/05/2005 A 04/04/05 06/06/2006 0 1 F 1
5 10/05/2005 A 04/04/05 06/06/2006 7 1 F 1
6 05/05/2005 B 04/04/05 06/06/2006 6 0 M 2
7 06/05/2005 B 04/04/05 06/06/2006 0 0 M 2
8 07/05/2005 B 04/04/05 06/06/2006 1 0 M 2
9 01/05/2005 C 04/04/05 06/06/2006 3 1 F 1
10 02/05/2005 C 04/04/05 06/06/2006 0 1 F 1
11 03/05/2005 C 04/04/05 06/06/2006 7 1 F 1
CodePudding user response:
One option is with complete from pyjanitor, which abstracts the process of exposing missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
df = pd.read_clipboard() # ignore this, just me reading in the data
# converting the date to date time,
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# create an anonymous function to generate all dates
# between the max and min date for each combination
# of ID, start date and end date
new_dates = {'date':lambda d: pd.date_range(d.min(), d.max(), freq='D')}
(df
.complete(new_dates,
by = ['ID','StartDate','EndDate'],
fill_value = {'Count':0})
.ffill(downcast='infer')
)
ID date StartDate EndDate Count Cov1 Cov2 Cov3
0 A 2005-05-05 04/04/05 06/06/2006 3 1 F 1
1 A 2005-05-06 04/04/05 06/06/2006 5 1 F 1
2 A 2005-05-07 04/04/05 06/06/2006 2 1 F 1
3 A 2005-05-08 04/04/05 06/06/2006 0 1 F 1
4 A 2005-05-09 04/04/05 06/06/2006 0 1 F 1
5 A 2005-05-10 04/04/05 06/06/2006 7 1 F 1
6 B 2005-05-05 04/04/05 06/06/2006 6 0 M 2
7 B 2005-05-06 04/04/05 06/06/2006 0 0 M 2
8 B 2005-05-07 04/04/05 06/06/2006 1 0 M 2
9 C 2005-05-01 04/04/05 06/06/2006 3 1 F 1
10 C 2005-05-02 04/04/05 06/06/2006 0 1 F 1
11 C 2005-05-03 04/04/05 06/06/2006 7 1 F 1