Distribute out data on month using date_range - improve solution-CodePudding

I have dataframe (df below) with two date columns, DATE_IN and DATE_OUT. I want to distribute this data on month (df_target). df_target has two columns, PERIOD consisting of all encompassing dates in df, and 'n', set to zero.

The idea is simple. If one ID has a a date range of 12 months, say between july 2015 and july 2016. Then I want to add 1 to each corresponding month in df_target.

import pandas as pd
from datetime import datetime
from datetime import timedelta
import numpy as np

df=pd.DataFrame({'ID': range(0,10),
              'DATE_IN':['2015-05-23', '2016-08-13', '2014-06-13', '2014-03-19', '2013-01-13',
                        '2014-03-13', '2014-04-27', '2014-02-13', '2015-03-03', '2016-03-13'],
              'DATE_OUT':['2015-08-12', '2017-09-30', '2017-05-17', '2015-12-24', '2015-02-15',
                        '2017-03-19', '2016-02-20', '2015-01-10', '2015-09-21', '2016-04-23'],
              'CODE':[10,10,10,10,10,10,10,10,10,10]
                          })

df['DATE_IN'] = pd.to_datetime(df.DATE_IN)
df['DATE_OUT'] = pd.to_datetime(df.DATE_OUT)

target=pd.DataFrame({'PERIOD':pd.date_range(df.DATE_IN.min(), df.DATE_OUT.max(), freq='M'),
             'n':int(0)})

My attempt below is a for loop. Inside the loop I create a dataframe of one column, a date range. Then I use that month to access df_target at the corresponding month and add 1. This works. But performance is important to me in this case. I want to execute this, or an improved version of this, on a dataset with 8-9 million rows. Then perhaps a for loop will be slow?

Any ideas on how to improve this solutions?


for z in range(0, len(df)):
    
    tmp=pd.DataFrame({'PERIOD':pd.date_range(df.DATE_IN[z], df.DATE_OUT[z], freq='M')})
    
    for i in range(0, len(tmp)):
        
        info=target[target['PERIOD']==tmp.PERIOD[i]].n   1
        target.loc[info.index[0], 'n'] = (target.loc[info.index[0], 'n']   1)

CodePudding user response：

You can use pd.date_range to expand dates between DATE_IN and DATE_OUT columns then explode it. Finally, use count unique values for each month.

target = (
    df.apply(lambda x: pd.date_range(x['DATE_IN'], x['DATE_OUT'], freq='M'), axis=1)
      .explode().value_counts().sort_index().rename('n').rename_axis('PERIOD').reset_index()
)
print(target)

# Output
       PERIOD  n
0  2013-01-31  1
1  2013-02-28  1
2  2013-03-31  1
3  2013-04-30  1
4  2013-05-31  1
5  2013-06-30  1
...
52 2017-05-31  1
53 2017-06-30  1
54 2017-07-31  1
55 2017-08-31  1
56 2017-09-30  1