Home > Enterprise >  Generate date ranges using group by apply in pandas
Generate date ranges using group by apply in pandas

Time:12-16

I want to imitate prophet make_future_dataframe() functionality for multiple groups in a pandas dataframe.

If I would like to create a date range as a separate column I could do:

import pandas as pd
my_dataframe['prediction_range'] = pd.date_range(start=my_dataframe['date_column'].min(),
                                           periods=48,
                                           freq='M')

However my dataframe has the following structure:

id feature1 feature2 date_column
1     0         4.3    2022-01-01
2     0         3.3    2022-01-01
3     0         2.2    2022-01-01
4     1034      1.11   2022-01-01
5     1090      0.98   2022-01-01
6     1078      0      2022-01-01

I wanted to do the following:

def generate_date_range(date_column, data):
    dates = pd.date_range(start=data[date_column].unique()[0],
                    periods=48,
                    freq='M')
    return dates

And then:

my_dataframe = my_dataframe.groupby('id').apply(generate_date_ranges('date_columns', my_dataframe))

But I am getting the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1377, in apply
    func = com.is_builtin_func(func)
  File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/common.py", line 615, in is_builtin_func
    return _builtin_table.get(arg, arg)
TypeError: unhashable type: 'DatetimeIndex'

I am not sure if I am approaching the problem in the right way. I have also done this with a MultiIndex:

multi_index = pd.MultiIndex.from_product([pd.Index(file['id'].unique()), dates], names=('customer', 'prediction_date'))

And then reindexing an filling the NANs but I am not able to understand why the apply version does not work.

The desired output is:

    id feature1 feature2 date_column prediction_date
    1     0         4.3    2022-01-01 2022-03-01
    1     0         4.3    2022-01-01 2022-04-01
    1     0         4.3    2022-01-01 2022-05-01
    1     0         4.3    2022-01-01 2022-06-01
--- Up to 48 periods --
    2     0         3.3    2022-01-01 2022-03-01
    2     0         3.3    2022-01-01 2022-04-01
    2     0         3.3    2022-01-01 2022-05-01
    2     0         3.3    2022-01-01 2022-06-01

BR E

CodePudding user response:

Try doing some list comprehension on your groupby object where you reindex the dates then forward fill the id

df['date_column'] = pd.to_datetime(df['date_column'])
df = df.set_index('date_column')

new_df = pd.concat([g.reindex(pd.date_range(g.index.min(), periods=48, freq='MS'))
                for _,g in df.groupby('id')])

new_df['id'] = new_df['id'].ffill().astype(int)

            id  feature1  feature2
2022-01-01   1       0.0       4.3
2022-02-01   1       NaN       NaN
2022-03-01   1       NaN       NaN
2022-04-01   1       NaN       NaN
2022-05-01   1       NaN       NaN
...         ..       ...       ...
2025-08-01   6       NaN       NaN
2025-09-01   6       NaN       NaN
2025-10-01   6       NaN       NaN
2025-11-01   6       NaN       NaN
2025-12-01   6       NaN       NaN

Update

If there is only one record for each ID we can do the following. If there is more than one record for each ID then we will need to only keep the min value for each ID, preform the task below and merge everything back together.

# make sure your date is datetime
df['date_column'] = pd.to_datetime(df['date_column'])

# use index.repeat for the number of months you want
# in this case we will offset the min date for 48 months
new_df = df.reindex(df.index.repeat(48)).reset_index(drop=True)

# groupby the id, cumcount and set the type so we can offset
new_df['date_column'] = new_df['date_column'].values.astype('datetime64[M]')   \
new_df.groupby('id')['date_column'].cumcount().values.astype('timedelta64[M]')



     id  feature1  feature2 date_column
0     1         0       4.3  2022-01-01
1     1         0       4.3  2022-02-01
2     1         0       4.3  2022-03-01
3     1         0       4.3  2022-04-01
4     1         0       4.3  2022-05-01
..   ..       ...       ...         ...
283   6      1078       0.0  2025-08-01
284   6      1078       0.0  2025-09-01
285   6      1078       0.0  2025-10-01
286   6      1078       0.0  2025-11-01
287   6      1078       0.0  2025-12-01
  • Related