I have dataframe (df below) with two date columns, DATE_IN and DATE_OUT. I want to distribute this data on month (df_target). df_target has two columns, PERIOD consisting of all encompassing dates in df, and 'n', set to zero.
The idea is simple. If one ID has a a date range of 12 months, say between july 2015 and july 2016. Then I want to add 1 to each corresponding month in df_target.
import pandas as pd
from datetime import datetime
from datetime import timedelta
import numpy as np
df=pd.DataFrame({'ID': range(0,10),
'DATE_IN':['2015-05-23', '2016-08-13', '2014-06-13', '2014-03-19', '2013-01-13',
'2014-03-13', '2014-04-27', '2014-02-13', '2015-03-03', '2016-03-13'],
'DATE_OUT':['2015-08-12', '2017-09-30', '2017-05-17', '2015-12-24', '2015-02-15',
'2017-03-19', '2016-02-20', '2015-01-10', '2015-09-21', '2016-04-23'],
'CODE':[10,10,10,10,10,10,10,10,10,10]
})
df['DATE_IN'] = pd.to_datetime(df.DATE_IN)
df['DATE_OUT'] = pd.to_datetime(df.DATE_OUT)
target=pd.DataFrame({'PERIOD':pd.date_range(df.DATE_IN.min(), df.DATE_OUT.max(), freq='M'),
'n':int(0)})
My attempt below is a for loop. Inside the loop I create a dataframe of one column, a date range. Then I use that month to access df_target at the corresponding month and add 1. This works. But performance is important to me in this case. I want to execute this, or an improved version of this, on a dataset with 8-9 million rows. Then perhaps a for loop will be slow?
Any ideas on how to improve this solutions?
for z in range(0, len(df)):
tmp=pd.DataFrame({'PERIOD':pd.date_range(df.DATE_IN[z], df.DATE_OUT[z], freq='M')})
for i in range(0, len(tmp)):
info=target[target['PERIOD']==tmp.PERIOD[i]].n 1
target.loc[info.index[0], 'n'] = (target.loc[info.index[0], 'n'] 1)
CodePudding user response:
You can use pd.date_range
to expand dates between DATE_IN
and DATE_OUT
columns then explode it. Finally, use count unique values for each month.
target = (
df.apply(lambda x: pd.date_range(x['DATE_IN'], x['DATE_OUT'], freq='M'), axis=1)
.explode().value_counts().sort_index().rename('n').rename_axis('PERIOD').reset_index()
)
print(target)
# Output
PERIOD n
0 2013-01-31 1
1 2013-02-28 1
2 2013-03-31 1
3 2013-04-30 1
4 2013-05-31 1
5 2013-06-30 1
...
52 2017-05-31 1
53 2017-06-30 1
54 2017-07-31 1
55 2017-08-31 1
56 2017-09-30 1