Pandas create rows based on interval between to dates-CodePudding

I am trying to expand a dataframe containing a number of columns by creating rows based on the interval between two date columns.

For this I am currently using a method that basically creates a cartesian product, which works well on small datasets, but is not good in large sets because it is very inefficient.

This method will be used on a ~ 2-million row by 50 column Dataframe spanning multiple years from min to max date. The resulting dataset will be about 3 million rows, so a more effective approach is required.

I have not succeeded in finding an alternative method which is less resource intensive. What would be the best approach for this?

My current method here:

from datetime import date
import pandas as pd

raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}


df = pd.DataFrame(raw_data)

This gives the following result

Now to create a set containing all possible dates between the min and max date of the set:

df_d = pd.DataFrame({'date': pd.date_range(df['date_start'].min(), df['date_end'].max()   pd.Timedelta('1d'), freq='1d')})

This results in an expected frame containing all the possible dates

Finally to cross merge the original set with the date set and filter resulting rows based on start and end date per row

df_total = pd.merge(df, df_d,how='cross')
df = df_total[(df_total['date_start']<df_total['date']) & (df_total['date_end']>=df_total['date']) ]

This leads to the following final

This final dataframe is exactly what is needed.

CodePudding user response：

Efficient Solution

d = df['date_end'].sub(df['date_start']).dt.days
df1 = df.reindex(df.index.repeat(d))
i = df1.groupby(level=0).cumcount()   1

df1['date'] = df1['date_start']   pd.to_timedelta(i, unit='d')

How it works?

Subtract start from end to calculate the number of days elapsed, then reindex the dataframe by repeating the index exactly elapsed number of days times. Now group df1 by index and use cumcount to create a sequential counter then create a timedelta series using this counter and add this with date_start to get the result

Result

    id  number   color  date_start    date_end        date
0  aa0       1    blue  2022-01-01  2022-01-02  2022-01-02
1  aa1       2     red  2022-01-01  2022-01-04  2022-01-02
1  aa1       2     red  2022-01-01  2022-01-04  2022-01-03
1  aa1       2     red  2022-01-01  2022-01-04  2022-01-04
2  aa2       2  yellow  2022-01-07  2022-01-09  2022-01-08
2  aa2       2  yellow  2022-01-07  2022-01-09  2022-01-09
3  aa3       1   green  2022-01-12  2022-01-14  2022-01-13
3  aa3       1   green  2022-01-12  2022-01-14  2022-01-14

CodePudding user response：

I don't know if this is an approvement, here the pd.date_range only gets created for each start and end date in each row. the created list gets exploded and joined to the original df

from datetime import date
import pandas as pd

raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}


df = pd.DataFrame(raw_data)

s = df.apply(lambda x: pd.date_range(x['date_start'], x['date_end'], freq='1d',inclusive='right').date,axis=1).explode()
df.join(s.rename('date'))