Home > Back-end >  Pandas: How to expand a dataframe between dates and add NaNs to new rows
Pandas: How to expand a dataframe between dates and add NaNs to new rows

Time:07-19

This is a simple one but it is stumping me. I have a data frame consisting of day-level observations by individuals. However, not all individuals are observed on the same day. I need to create rows including the days in which individuals are not observed (i.e., count NaN or 0), between the dates where they are present.

The relevant data looks as follows

ID  date        StartDate  EndDate    Count    Cov1  Cov2 Cov3
A   05/05/2005    04/04/05   06/06/2006   3      1     F    1   
A   06/05/2005    04/04/05   06/06/2006   5      1     F    1
A   07/05/2005    04/04/05   06/06/2006   2      1     F    1
A   10/05/2005    04/04/05   06/06/2006   7      1     F    1
B   05/05/2005    04/04/05   06/06/2006   6      0     M    2   
B   07/05/2005    04/04/05   06/06/2006   1      0     M    2
C   01/05/2005    04/04/05   06/06/2006   3      1     F    1
C   03/05/2005    04/04/05   06/06/2006   7      1     F    1

However, I need it to look like this:

ID  date        StartDate  EndDate    Count    Cov1  Cov2 Cov3
A   05/05/2005    04/04/05   06/06/2006   3      1     F    1   
A   06/05/2005    04/04/05   06/06/2006   5      1     F    1
A   07/05/2005    04/04/05   06/06/2006   2      1     F    1
A   08/05/2005    04/04/05   06/06/2006   0      1     F    1
A   09/05/2005    04/04/05   06/06/2006   0      1     F    1
A   10/05/2005    04/04/05   06/06/2006   7      1     F    1
B   05/05/2005    04/04/05   06/06/2006   6      0     M    2   
B   06/05/2005    04/04/05   06/06/2006   0      0     M    2   
B   07/05/2005    04/04/05   06/06/2006   1      0     M    2
C   01/05/2005    04/04/05   06/06/2006   3      1     F    1
C   02/05/2005    04/04/05   06/06/2006   0      1     F    1
C   03/05/2005    04/04/05   06/06/2006   7      1     F    1

So, the data should expand by days not counted in the original but between the start and end dates. However, the count variable must not copy over, while all other covariates have to copy.

CodePudding user response:

Create DatetimeIndex by column date first, so possible use custom lambda function with DataFrame.asfreq, remove first level of MultiIndex and convert index to column date, last use Series.dt.strftime for original format DD/MM/YYYY:

First is possible test duplicated rows by columns ID, date:

print (df[df.duplicated(['ID','date'], keep=False)])
  ID        date StartDate     EndDate  Count  Cov1 Cov2  Cov3
0  A  05/05/2005  04/04/05  06/06/2006      3     1    F     1
1  A  05/05/2005  04/04/05  06/06/2006      3     1    F     1

If possible, remove duplicates:

df = df.drop_duplicates(['ID','date'])

df['date'] = pd.to_datetime(df['date'], dayfirst=True)

df1 = (df.set_index('date').groupby('ID')
        .apply(lambda x: x.asfreq('D', method='ffill'))
        .droplevel(0)
        .reset_index())

print (df1)
         date ID StartDate     EndDate  Count  Cov1 Cov2  Cov3
0  2005-05-05  A  04/04/05  06/06/2006      3     1    F     1
1  2005-05-06  A  04/04/05  06/06/2006      5     1    F     1
2  2005-05-07  A  04/04/05  06/06/2006      2     1    F     1
3  2005-05-08  A  04/04/05  06/06/2006      2     1    F     1
4  2005-05-09  A  04/04/05  06/06/2006      2     1    F     1
5  2005-05-10  A  04/04/05  06/06/2006      7     1    F     1
6  2005-05-05  B  04/04/05  06/06/2006      6     0    M     2
7  2005-05-06  B  04/04/05  06/06/2006      6     0    M     2
8  2005-05-07  B  04/04/05  06/06/2006      1     0    M     2
9  2005-05-01  C  04/04/05  06/06/2006      3     1    F     1
10 2005-05-02  C  04/04/05  06/06/2006      3     1    F     1
11 2005-05-03  C  04/04/05  06/06/2006      7     1    F     1

print (df1.index.name)
None

If possible in real data it is ID use:

df1 = df1.rename_axis(None)

m = df1.merge(df, indicator=True, how='left')['_merge'].eq('left_only')
df1.loc[m, 'Count'] = 0

df1['date'] = df1['date'].dt.strftime('%d/%m/%Y')
print (df1)
          date ID StartDate     EndDate  Count  Cov1 Cov2  Cov3
0   05/05/2005  A  04/04/05  06/06/2006      3     1    F     1
1   06/05/2005  A  04/04/05  06/06/2006      5     1    F     1
2   07/05/2005  A  04/04/05  06/06/2006      2     1    F     1
3   08/05/2005  A  04/04/05  06/06/2006      0     1    F     1
4   09/05/2005  A  04/04/05  06/06/2006      0     1    F     1
5   10/05/2005  A  04/04/05  06/06/2006      7     1    F     1
6   05/05/2005  B  04/04/05  06/06/2006      6     0    M     2
7   06/05/2005  B  04/04/05  06/06/2006      0     0    M     2
8   07/05/2005  B  04/04/05  06/06/2006      1     0    M     2
9   01/05/2005  C  04/04/05  06/06/2006      3     1    F     1
10  02/05/2005  C  04/04/05  06/06/2006      0     1    F     1
11  03/05/2005  C  04/04/05  06/06/2006      7     1    F     1

CodePudding user response:

One option is with complete from pyjanitor, which abstracts the process of exposing missing rows:

# pip install pyjanitor
import pandas as pd
import janitor

df = pd.read_clipboard() # ignore this, just me reading in the data

# converting the date to date time, 
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# create an anonymous function to generate all dates 
# between the max and min date for each combination 
# of ID, start date and end date
new_dates = {'date':lambda d: pd.date_range(d.min(), d.max(), freq='D')}

(df
.complete(new_dates, 
          by = ['ID','StartDate','EndDate'], 
          fill_value = {'Count':0})
.ffill(downcast='infer')
)
   ID       date StartDate     EndDate  Count  Cov1 Cov2  Cov3
0   A 2005-05-05  04/04/05  06/06/2006      3     1    F     1
1   A 2005-05-06  04/04/05  06/06/2006      5     1    F     1
2   A 2005-05-07  04/04/05  06/06/2006      2     1    F     1
3   A 2005-05-08  04/04/05  06/06/2006      0     1    F     1
4   A 2005-05-09  04/04/05  06/06/2006      0     1    F     1
5   A 2005-05-10  04/04/05  06/06/2006      7     1    F     1
6   B 2005-05-05  04/04/05  06/06/2006      6     0    M     2
7   B 2005-05-06  04/04/05  06/06/2006      0     0    M     2
8   B 2005-05-07  04/04/05  06/06/2006      1     0    M     2
9   C 2005-05-01  04/04/05  06/06/2006      3     1    F     1
10  C 2005-05-02  04/04/05  06/06/2006      0     1    F     1
11  C 2005-05-03  04/04/05  06/06/2006      7     1    F     1
  • Related