how to fill date range between min and max date for each group?-CodePudding

data

userid  data_date 
A        2019-02-03
A        2019-04-05
B        2019-03-04
B        2019-12-31

Goal

import pandas as pd
import janitor 
new_dates = {"data_date": pd.date_range('2019-01-01', '2019-12-31', freq="D")}

d1=df.complete(new_dates,by=['userid'])

The result

userid  data_date 
A        2019-01-01
……       ……
A        2019-04-05
B        2019-01-01
……       ……
B        2019-12-31

My expected result

userid  data_date 
A        2019-02-03
……       ……
A        2019-04-05
B        2019-03-04
……       ……
B        2019-12-31

CodePudding user response：

Since you are building it per group, you can use a lambda expression to get the min and max per group - I will reuse @MarceloAndrioni's data:

# pip install pyjanitor
import pandas as pd
import janitor

df = pd.DataFrame(data=[['A', '2019-02-03'],
                        ['A', '2019-04-05'],
                        ['B', '2019-03-04'],
                        ['B', '2019-12-31']], columns=['userid', 'data_date'])


df.data_date = pd.to_datetime(df.data_date)

new_dates = {'data_date' : lambda df: pd.date_range(df.min(), df.max(), freq='D')}

out = df.complete(new_dates, by = 'userid')

out.head(10)


  userid  data_date
0      A 2019-02-03
1      A 2019-02-04
2      A 2019-02-05
3      A 2019-02-06
4      A 2019-02-07
5      A 2019-02-08
6      A 2019-02-09
7      A 2019-02-10
8      A 2019-02-11
9      A 2019-02-12

out.tail(10)

    userid  data_date
355      B 2019-12-22
356      B 2019-12-23
357      B 2019-12-24
358      B 2019-12-25
359      B 2019-12-26
360      B 2019-12-27
361      B 2019-12-28
362      B 2019-12-29
363      B 2019-12-30
364      B 2019-12-31

You can have a look here for more examples.

CodePudding user response：

Maybe something like this where you create new dataframes using the minmax date and then concatenate them.

import numpy as np
import pandas as pd

df = pd.DataFrame(data=[['A', '2019-02-03'],
                        ['A', '2019-04-05'],
                        ['B', '2019-03-04'],
                        ['B', '2019-12-31']], columns=['userid', 'data_date'])

dfs = []
for userid in df['userid'].unique():

    series = pd.date_range(start=df[df['userid'] == userid]['data_date'].min(),
                           end=df[df['userid'] == userid]['data_date'].max(),
                           freq='D')

    df2 = pd.DataFrame(data=zip(np.full(series.shape, userid), series),
                       columns=['userid', 'data_date'])
    dfs.append(df2)

df = pd.concat(dfs)