For each id in my dataframe, I want to sample rows between a random date range period (can be 1, 2, 4 weeks) that is inside the min and max dates of that given id. For example, for given id I have data between 2022-09-01
and 2022-09-28
I want to sample random 7 days which would give me all records between a random start and end date like 2022-09-06
and 2022-09-12
.
One idea I have is just randomly sampling a start_date and then adding my desired interval on top of it to find end_date, check if it is a viable end_date if not repeat the first start_date sampling if it is viable then filter the data based on that range, but with this approach I am missing out on a lot of combinations and feel like there has to be a better way for this.
Example df:
data = [{'account_id': 1, 'date': Timestamp('2022-09-01 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-02 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-03 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-04 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-05 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-06 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-07 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-08 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-09 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-10 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-11 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-12 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-13 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-14 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-15 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-16 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-17 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-18 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-19 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-20 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-21 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-22 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-23 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-24 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-25 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-26 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-27 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-28 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-01 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-02 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-03 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-04 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-05 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-06 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-07 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-08 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-09 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-10 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-11 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-12 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-13 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-14 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-15 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-16 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-17 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-18 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-19 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-20 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-21 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-22 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-23 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-24 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-25 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-26 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-27 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-28 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-29 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-30 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-01 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-02 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-03 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-04 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-10-05 00:00:00')}]
df = pd.DataFrame(data)
Output example:
output_data = [{'account_id': 1, 'date': Timestamp('2022-09-06 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-07 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-08 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-09 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-10 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-11 00:00:00')},
{'account_id': 1, 'date': Timestamp('2022-09-12 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-15 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-16 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-17 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-18 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-19 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-20 00:00:00')},
{'account_id': 2, 'date': Timestamp('2022-09-21 00:00:00')}]
output = pd.DataFrame(output_data)
CodePudding user response:
IIUC, here is one way you can do it:
Basically you groupby account_id
, then create a pd.date_range
of the group, starting at its minimum, ending at its maximum minus the interval of days (or anything else. For more information, read the documentation of pandas.tseries.offsets.DateOffset
) you chose, because you can't get random 7 increasing days if you start 2 days before the groups max.
pd.date_range
returns a list with all possible dates you can start. From this list you pick one date using random.choice
and pass it to another pd.date_range
as start, together with period which is again the number of intervals you chose.
import pandas as pd
import random
num_of_days = 7 # or whatever interval you choose
out = (
df
.groupby('account_id')
.apply(lambda grp:
pd.date_range(
start = random.choice(
[*pd.date_range(
grp['date'].min(),
grp['date'].max() - pd.DateOffset(days=num_of_days)
)
]
),
periods = num_of_days)
)
.explode()
.reset_index(name='random_consecutive_days')
)
print(out)
account_id random_consecutive_days
0 1 2022-09-13
1 1 2022-09-14
2 1 2022-09-15
3 1 2022-09-16
4 1 2022-09-17
5 1 2022-09-18
6 1 2022-09-19
7 2 2022-09-20
8 2 2022-09-21
9 2 2022-09-22
10 2 2022-09-23
11 2 2022-09-24
12 2 2022-09-25
13 2 2022-09-26