Groupby count between multiple date ranges since last-contact date-CodePudding

Customer data, and campaign data with each time we have contacted them. We don't contact each customer each time, so their last contacted(touched) date varies. How to achieve a groupby count, but between two dates that varies for each cust_id.

import pandas as pd
import io

tempCusts=u"""cust_id, lastBookedDate
1, 10-02-2022
2, 20-04-2022
3, 25-07-2022
4, 10-06-2022
5, 10-05-2022
6, 10-08-2022
7, 01-01-2021
8, 02-06-2022
9, 11-12-2021
10, 10-05-2022
"""
tempCamps=u"""cust_id,campaign_id,campaignMonth,campaignYear,touch,campaignDate,campaignNum
        1,CN2204,4,2022,1,01-04-2022,1
        2,CN2204,4,2022,1,01-04-2022,1
        3,CN2204,4,2022,1,01-04-2022,1
        4,CN2204,4,2022,1,01-04-2022,1
        5,CN2204,4,2022,1,01-04-2022,1
        6,CN2204,4,2022,1,01-04-2022,1
        7,CN2204,4,2022,1,01-04-2022,1
        8,CN2204,4,2022,1,01-04-2022,1
        9,CN2204,4,2022,1,01-04-2022,1
        10,CN2204,4,2022,1,01-04-2022,1
        1,CN2205,5,2022,1,01-05-2022,2
        2,CN2205,5,2022,1,01-05-2022,2
        3,CN2205,5,2022,1,01-05-2022,2
        4,CN2205,5,2022,1,01-05-2022,2
        5,CN2205,5,2022,1,01-05-2022,2
        6,CN2206,6,2022,1,01-06-2022,3
        7,CN2206,6,2022,1,01-06-2022,3
        8,CN2206,6,2022,1,01-06-2022,3
        9,CN2206,6,2022,1,01-06-2022,3
        10,CN2206,6,2022,1,01-06-2022,3"""
  
campaignDets = pd.read_csv(io.StringIO(tempCamps), parse_dates=True)
customerDets = pd.read_csv(io.StringIO(tempCusts), parse_dates=True)

Campaign details (campaignDets) is any customer who was part of campaign, some(most) appear in multiple campaigns as they continue to be contacted. cust_id is therefore duplicated, but not within each campaign. The customer details(customerDets), showing if/when they last had appointment.

cust_id 1: lastBooked 10-02-2022 So touchCount since then == 2
cust_id 2: last booked 20-04-2022 So touchCount since then == 1 ...

This is what i'm attempting to achieve:

desired=u"""cust_id,lastBookedDate, touchesSinceBooked
1,10-02-2022,2
2,20-04-2022,1
3,25-07-2022,0
4,10-06-2022,0
5,10-05-2022,0
6,10-08-2022,0
7,01-01-2021,3
8,02-06-2022,0
9,11-12-2021,3
10,10-05-2022,1
"""
desiredDf = pd.read_csv(io.StringIO(desired), parse_dates=True)

>>> desiredDf
   cust_id lastBookedDate   touchesSinceBooked
0        1     10-02-2022                    2
1        2     20-04-2022                    1
2        3     25-07-2022                    0
3        4     10-06-2022                    0
4        5     10-05-2022                    0
5        6     10-08-2022                    0
6        7     01-01-2021                    2
7        8     02-06-2022                    0
8        9     11-12-2021                    3
9       10     10-05-2022                    1

I've attempted to work around the guidance given on not-dissimilar problems, but these either rely on a fixed date to group on, or haven't worked within the constraints here(unless i'm missing something). I have not yet been able to cross-relate previous questions, and am sure that the simplicity of what i'm after cannot be best achieved by some awful groupby split by user into a list of df's pulling them back out & looping through a max() of each user_ids campaignDate. Surely not. Can i apply pd.merge_asof within this?

Those examples i've taken advice from that are along the same lines:
44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr
31772863/count-number-of-rows-between-two-dates-by-id-in-a-pandas-groupby-dataframe/31773404

Constraints? None. Am happy to use any available library and/or helper cols. Neither data source/df is especially large but the custDets ~120k, and the campaignDets ~600k, but i have time...so optimised approaches though welcome are secondary to actual solutions.

CodePudding user response：

First, format as datetime:

customerDets['lastBookedDate'] = pd.to_datetime(customerDets[' lastBookedDate'], dayfirst=True)
campaignDets['campaignDate'] = pd.to_datetime(campaignDets['campaignDate'], dayfirst=True)

Then, filter on when the campaign date is greater than last booked:

df = campaignDets[(campaignDets['campaignDate']>campaignDets['cust_id'].map(customerDets.set_index('cust_id')['lastBookedDate']))]

Finally, add your new column:

customerDets['touchesSinceBooked'] = customerDets['cust_id'].map(df.groupby('cust_id')['touch'].sum()).fillna(0)

You'll get

   cust_id  lastBookedDate  touchesSinceBooked
0        1      10-02-2022                 2.0
1        2      20-04-2022                 1.0
2        3      25-07-2022                 0.0
3        4      10-06-2022                 0.0
4        5      10-05-2022                 0.0
5        6      10-08-2022                 0.0
6        7      01-01-2021                 2.0
7        8      02-06-2022                 0.0
8        9      11-12-2021                 2.0
9       10      10-05-2022                 1.0