Home > Blockchain >  Selecting random periods by date
Selecting random periods by date

Time:10-31

For each id in my dataframe, I want to sample rows between a random date range period (can be 1, 2, 4 weeks) that is inside the min and max dates of that given id. For example, for given id I have data between 2022-09-01 and 2022-09-28 I want to sample random 7 days which would give me all records between a random start and end date like 2022-09-06 and 2022-09-12.

One idea I have is just randomly sampling a start_date and then adding my desired interval on top of it to find end_date, check if it is a viable end_date if not repeat the first start_date sampling if it is viable then filter the data based on that range, but with this approach I am missing out on a lot of combinations and feel like there has to be a better way for this.

Example df:

data = [{'account_id': 1, 'date': Timestamp('2022-09-01 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-02 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-03 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-04 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-05 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-06 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-07 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-08 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-09 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-10 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-11 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-12 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-13 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-14 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-15 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-16 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-17 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-18 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-19 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-20 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-21 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-22 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-23 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-24 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-25 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-26 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-27 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-28 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-01 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-02 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-03 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-04 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-05 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-06 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-07 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-08 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-09 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-10 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-11 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-12 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-13 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-14 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-15 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-16 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-17 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-18 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-19 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-20 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-21 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-22 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-23 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-24 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-25 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-26 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-27 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-28 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-29 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-30 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-10-01 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-10-02 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-10-03 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-10-04 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-10-05 00:00:00')}]

df = pd.DataFrame(data)

Output example:

output_data = [{'account_id': 1, 'date': Timestamp('2022-09-06 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-07 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-08 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-09 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-10 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-11 00:00:00')},
 {'account_id': 1, 'date': Timestamp('2022-09-12 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-15 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-16 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-17 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-18 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-19 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-20 00:00:00')},
 {'account_id': 2, 'date': Timestamp('2022-09-21 00:00:00')}]

output = pd.DataFrame(output_data)

CodePudding user response:

IIUC, here is one way you can do it:

Basically you groupby account_id, then create a pd.date_range of the group, starting at its minimum, ending at its maximum minus the interval of days (or anything else. For more information, read the documentation of pandas.tseries.offsets.DateOffset) you chose, because you can't get random 7 increasing days if you start 2 days before the groups max. pd.date_range returns a list with all possible dates you can start. From this list you pick one date using random.choice and pass it to another pd.date_range as start, together with period which is again the number of intervals you chose.

import pandas as pd
import random

num_of_days = 7 # or whatever interval you choose
out = (
    df
    .groupby('account_id')
    .apply(lambda grp: 
           pd.date_range(
               start = random.choice(
                   [*pd.date_range(
                       grp['date'].min(), 
                       grp['date'].max() - pd.DateOffset(days=num_of_days)
                   )
                   ]
               ), 
               periods = num_of_days)
          )
    .explode()
    .reset_index(name='random_consecutive_days')
)
print(out)
    account_id random_consecutive_days
0            1              2022-09-13
1            1              2022-09-14
2            1              2022-09-15
3            1              2022-09-16
4            1              2022-09-17
5            1              2022-09-18
6            1              2022-09-19
7            2              2022-09-20
8            2              2022-09-21
9            2              2022-09-22
10           2              2022-09-23
11           2              2022-09-24
12           2              2022-09-25
13           2              2022-09-26
  • Related