Home > front end >  how to fill date range between min and max date for each group?
how to fill date range between min and max date for each group?

Time:05-13

data

userid  data_date 
A        2019-02-03
A        2019-04-05
B        2019-03-04
B        2019-12-31

Goal

import pandas as pd
import janitor 
new_dates = {"data_date": pd.date_range('2019-01-01', '2019-12-31', freq="D")}

d1=df.complete(new_dates,by=['userid'])

The result

userid  data_date 
A        2019-01-01
……       ……
A        2019-04-05
B        2019-01-01
……       ……
B        2019-12-31

My expected result

userid  data_date 
A        2019-02-03
……       ……
A        2019-04-05
B        2019-03-04
……       ……
B        2019-12-31

CodePudding user response:

Since you are building it per group, you can use a lambda expression to get the min and max per group - I will reuse @MarceloAndrioni's data:

# pip install pyjanitor
import pandas as pd
import janitor

df = pd.DataFrame(data=[['A', '2019-02-03'],
                        ['A', '2019-04-05'],
                        ['B', '2019-03-04'],
                        ['B', '2019-12-31']], columns=['userid', 'data_date'])


df.data_date = pd.to_datetime(df.data_date)

new_dates = {'data_date' : lambda df: pd.date_range(df.min(), df.max(), freq='D')}

out = df.complete(new_dates, by = 'userid')

out.head(10)


  userid  data_date
0      A 2019-02-03
1      A 2019-02-04
2      A 2019-02-05
3      A 2019-02-06
4      A 2019-02-07
5      A 2019-02-08
6      A 2019-02-09
7      A 2019-02-10
8      A 2019-02-11
9      A 2019-02-12
out.tail(10)

    userid  data_date
355      B 2019-12-22
356      B 2019-12-23
357      B 2019-12-24
358      B 2019-12-25
359      B 2019-12-26
360      B 2019-12-27
361      B 2019-12-28
362      B 2019-12-29
363      B 2019-12-30
364      B 2019-12-31

You can have a look here for more examples.

CodePudding user response:

Maybe something like this where you create new dataframes using the minmax date and then concatenate them.

import numpy as np
import pandas as pd

df = pd.DataFrame(data=[['A', '2019-02-03'],
                        ['A', '2019-04-05'],
                        ['B', '2019-03-04'],
                        ['B', '2019-12-31']], columns=['userid', 'data_date'])

dfs = []
for userid in df['userid'].unique():

    series = pd.date_range(start=df[df['userid'] == userid]['data_date'].min(),
                           end=df[df['userid'] == userid]['data_date'].max(),
                           freq='D')

    df2 = pd.DataFrame(data=zip(np.full(series.shape, userid), series),
                       columns=['userid', 'data_date'])
    dfs.append(df2)

df = pd.concat(dfs)
  • Related