Home > front end >  Broadcasting date offset addition in pandas
Broadcasting date offset addition in pandas

Time:04-27

I would like to apply a sequence of 3 month offsets to a series of dates

Example input

0   2001-01-03
1   2000-07-08
2   2004-12-30
dtype: datetime64[ns]

Example output

           0          1          2          3          4
0 2001-01-31 2001-04-30 2001-07-31 2001-10-31 2002-01-31
1 2000-07-31 2000-10-31 2001-01-31 2001-04-30 2001-07-31
2 2004-12-31 2005-03-31 2005-06-30 2005-09-30 2005-12-31

How can I go about doing this in pandas?

Some attemps:

start_dates = pd.Series(pd.to_datetime(["2001-01-03", "2000-07-08", "2004-12-30"]))

# works but uses apply
start_dates.apply(lambda date: pd.Series(pd.date_range(start=date, periods=5, freq='3M'), index=np.arange(5)))

# works but uses apply
start_dates.apply(lambda date: date pd.Series(pd.tseries.offsets.DateOffset(months=3)*np.arange(5), index=np.arange(5)))

# UFuncTypeError: ufunc 'add' cannot use operands with types dtype('<M8[ns]') and dtype('O')
start_dates.values[:, None]   pd.tseries.offsets.DateOffset(months=3)*np.arange(5)

Is there a better way?

CodePudding user response:

Use your code with concat

pd.concat([start_dates]*5,axis=1).add(pd.tseries.offsets.DateOffset(months=3)*np.arange(5))

CodePudding user response:

Idea is convert datetimes to months by datetime64[M], then use numpy broadcasting for add months timedeltas and for last value of month add next month with subtract one day:

months = 3
maximal = 5
arr = (start_dates.to_numpy().astype('datetime64[M]')[:, None]   
       np.array(np.arange(0, months * (maximal-1)   1 ,months), dtype='timedelta64[M]')  
       np.array([1], dtype='timedelta64[M]') -
       np.array([1], dtype='timedelta64[D]'))
df = pd.DataFrame(arr)
print (df)
           0          1          2          3          4
0 2001-01-31 2001-04-30 2001-07-31 2001-10-31 2002-01-31
1 2000-07-31 2000-10-31 2001-01-31 2001-04-30 2001-07-31
2 2004-12-31 2005-03-31 2005-06-30 2005-09-30 2005-12-31

Performance:

start_dates = pd.Series(pd.to_datetime(["2001-01-03", "2000-07-08", "2004-12-30"] * 10000))

arr = (start_dates.to_numpy().astype('datetime64[M]')[:, None]   
       np.array(np.arange(0,13,3), dtype='timedelta64[M]')  
       np.array([1], dtype='timedelta64[M]') -
       np.array([1], dtype='timedelta64[D]'))
df = pd.DataFrame(arr)
print (df)


In [146]: %timeit pd.DataFrame(arr)
6.08 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [147]: %timeit pd.concat([start_dates]*5,axis=1).add(pd.tseries.offsets.DateOffset(months=3)*np.arange(5))
PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized
  warnings.warn(
4.12 s ± 188 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related