Grouping by date range (timedelta) with Pandas-CodePudding

This question was asked before, but I want to extend on it. Because I do not have enough experience points I could not comment on the question so I am reposting the link below followed by my comments:

Grouping by date range with pandas

I believe asker of this question wants to group items together within a specified timedelta of each other (3 days is specified in the question). However the answers, including the one marked correct, relate to grouping items in frequencies of 3 days using Grouper. This eventually suits the asker because he only wants to group at most two items together, but what happens if this extends to three, four, five or more items?

Continuing the askers example code (which very closely relates to my own problem):

user_id     date       val
1           1-1-17     1
2           1-1-17     1
3           1-1-17     1
1           1-1-17     1
1           1-2-17     1
2           1-2-17     1
2           1-10-17    1
3           2-1-17     1
3           2-2-17     1
3           2-3-17     2
3           2-4-17     3
3           2-5-17     1

If the grouping would group by user_id and dates /- 3 days from each other the group by summing val should look like:

user_id     date       sum(val)
1           1-2-17     3
2           1-2-17     2
2           1-10-17    1
3           1-1-17     1
3           2-1-17     8

I'm not sure the last date will actually show as 2-1-17, but the idea is to group all dates within a 3-day timedelta of each other together.

Is this possible in an elegant way using Grouper, resample or other Pandas or Python date functions?

CodePudding user response：

You can use a groupby with a custom group:

# convert to datetime
s = pd.to_datetime(df['date'], dayfirst=False)
# set up groups of consecutive dates within ± 3 days
group = (s.groupby(df['user_id'])
          .apply(lambda s: s.diff().abs().gt('3days').cumsum())
         )

# group by ID and new group and aggregate
out = (df.groupby(['user_id', group], as_index=False)
         .agg({'date': 'last', 'val': 'sum'})
      )

output:

   user_id     date  val
0        1   1-2-17    3
1        2   1-2-17    2
2        2  1-10-17    1
3        3   1-1-17    1
4        3   2-5-17    8

intermediates (sorted by user_id for clarity):

    user_id     date  val   datetime    diff     abs  >3days  cumsum
0         1   1-1-17    1 2017-01-01     NaT     NaT   False       0
3         1   1-1-17    1 2017-01-01  0 days  0 days   False       0
4         1   1-2-17    1 2017-01-02  1 days  1 days   False       0
1         2   1-1-17    1 2017-01-01     NaT     NaT   False       0
5         2   1-2-17    1 2017-01-02  1 days  1 days   False       0
6         2  1-10-17    1 2017-01-10  8 days  8 days    True       1
2         3   1-1-17    1 2017-01-01     NaT     NaT   False       0
7         3   2-1-17    1 2017-02-01 31 days 31 days    True       1
8         3   2-2-17    1 2017-02-02  1 days  1 days   False       1
9         3   2-3-17    2 2017-02-03  1 days  1 days   False       1
10        3   2-4-17    3 2017-02-04  1 days  1 days   False       1
11        3   2-5-17    1 2017-02-05  1 days  1 days   False       1