I would like to apply a sequence of 3 month offsets to a series of dates
Example input
0 2001-01-03
1 2000-07-08
2 2004-12-30
dtype: datetime64[ns]
Example output
0 1 2 3 4
0 2001-01-31 2001-04-30 2001-07-31 2001-10-31 2002-01-31
1 2000-07-31 2000-10-31 2001-01-31 2001-04-30 2001-07-31
2 2004-12-31 2005-03-31 2005-06-30 2005-09-30 2005-12-31
How can I go about doing this in pandas?
Some attemps:
start_dates = pd.Series(pd.to_datetime(["2001-01-03", "2000-07-08", "2004-12-30"]))
# works but uses apply
start_dates.apply(lambda date: pd.Series(pd.date_range(start=date, periods=5, freq='3M'), index=np.arange(5)))
# works but uses apply
start_dates.apply(lambda date: date pd.Series(pd.tseries.offsets.DateOffset(months=3)*np.arange(5), index=np.arange(5)))
# UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[ns]') and dtype('O')
start_dates.values[:, None] pd.tseries.offsets.DateOffset(months=3)*np.arange(5)
Is there a better way?
CodePudding user response:
Use your code with concat
pd.concat([start_dates]*5,axis=1).add(pd.tseries.offsets.DateOffset(months=3)*np.arange(5))
CodePudding user response:
Idea is convert datetimes to months by datetime64[M]
, then use numpy broadcasting for add months timedeltas and for last value of month add next month with subtract one day:
months = 3
maximal = 5
arr = (start_dates.to_numpy().astype('datetime64[M]')[:, None]
np.array(np.arange(0, months * (maximal-1) 1 ,months), dtype='timedelta64[M]')
np.array([1], dtype='timedelta64[M]') -
np.array([1], dtype='timedelta64[D]'))
df = pd.DataFrame(arr)
print (df)
0 1 2 3 4
0 2001-01-31 2001-04-30 2001-07-31 2001-10-31 2002-01-31
1 2000-07-31 2000-10-31 2001-01-31 2001-04-30 2001-07-31
2 2004-12-31 2005-03-31 2005-06-30 2005-09-30 2005-12-31
Performance:
start_dates = pd.Series(pd.to_datetime(["2001-01-03", "2000-07-08", "2004-12-30"] * 10000))
arr = (start_dates.to_numpy().astype('datetime64[M]')[:, None]
np.array(np.arange(0,13,3), dtype='timedelta64[M]')
np.array([1], dtype='timedelta64[M]') -
np.array([1], dtype='timedelta64[D]'))
df = pd.DataFrame(arr)
print (df)
In [146]: %timeit pd.DataFrame(arr)
6.08 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [147]: %timeit pd.concat([start_dates]*5,axis=1).add(pd.tseries.offsets.DateOffset(months=3)*np.arange(5))
PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized
warnings.warn(
4.12 s ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)