Customer data, and campaign data with each time we have contacted them. We don't contact each customer each time, so their last contacted(touched) date varies. How to achieve a groupby count, but between two dates that varies for each cust_id.
import pandas as pd
import io
tempCusts=u"""cust_id, lastBookedDate
1, 10-02-2022
2, 20-04-2022
3, 25-07-2022
4, 10-06-2022
5, 10-05-2022
6, 10-08-2022
7, 01-01-2021
8, 02-06-2022
9, 11-12-2021
10, 10-05-2022
"""
tempCamps=u"""cust_id,campaign_id,campaignMonth,campaignYear,touch,campaignDate,campaignNum
1,CN2204,4,2022,1,01-04-2022,1
2,CN2204,4,2022,1,01-04-2022,1
3,CN2204,4,2022,1,01-04-2022,1
4,CN2204,4,2022,1,01-04-2022,1
5,CN2204,4,2022,1,01-04-2022,1
6,CN2204,4,2022,1,01-04-2022,1
7,CN2204,4,2022,1,01-04-2022,1
8,CN2204,4,2022,1,01-04-2022,1
9,CN2204,4,2022,1,01-04-2022,1
10,CN2204,4,2022,1,01-04-2022,1
1,CN2205,5,2022,1,01-05-2022,2
2,CN2205,5,2022,1,01-05-2022,2
3,CN2205,5,2022,1,01-05-2022,2
4,CN2205,5,2022,1,01-05-2022,2
5,CN2205,5,2022,1,01-05-2022,2
6,CN2206,6,2022,1,01-06-2022,3
7,CN2206,6,2022,1,01-06-2022,3
8,CN2206,6,2022,1,01-06-2022,3
9,CN2206,6,2022,1,01-06-2022,3
10,CN2206,6,2022,1,01-06-2022,3"""
campaignDets = pd.read_csv(io.StringIO(tempCamps), parse_dates=True)
customerDets = pd.read_csv(io.StringIO(tempCusts), parse_dates=True)
Campaign details (campaignDets) is any customer who was part of campaign, some(most) appear in multiple campaigns as they continue to be contacted. cust_id is therefore duplicated, but not within each campaign. The customer details(customerDets), showing if/when they last had appointment.
- cust_id 1: lastBooked 10-02-2022 So touchCount since then == 2
- cust_id 2: last booked 20-04-2022 So touchCount since then == 1 ...
This is what i'm attempting to achieve:
desired=u"""cust_id,lastBookedDate, touchesSinceBooked
1,10-02-2022,2
2,20-04-2022,1
3,25-07-2022,0
4,10-06-2022,0
5,10-05-2022,0
6,10-08-2022,0
7,01-01-2021,3
8,02-06-2022,0
9,11-12-2021,3
10,10-05-2022,1
"""
desiredDf = pd.read_csv(io.StringIO(desired), parse_dates=True)
>>> desiredDf
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2
1 2 20-04-2022 1
2 3 25-07-2022 0
3 4 10-06-2022 0
4 5 10-05-2022 0
5 6 10-08-2022 0
6 7 01-01-2021 2
7 8 02-06-2022 0
8 9 11-12-2021 3
9 10 10-05-2022 1
I've attempted to work around the guidance given on not-dissimilar problems, but these either rely on a fixed date to group on, or haven't worked within the constraints here(unless i'm missing something). I have not yet been able to cross-relate previous questions, and am sure that the simplicity of what i'm after cannot be best achieved by some awful groupby split by user into a list of df's pulling them back out & looping through a max() of each user_ids campaignDate. Surely not. Can i apply pd.merge_asof within this?
Those examples i've taken advice from that are along the same lines:
44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr
31772863/count-number-of-rows-between-two-dates-by-id-in-a-pandas-groupby-dataframe/31773404
Constraints? None. Am happy to use any available library and/or helper cols. Neither data source/df is especially large but the custDets ~120k, and the campaignDets ~600k, but i have time...so optimised approaches though welcome are secondary to actual solutions.
CodePudding user response:
First, format as datetime:
customerDets['lastBookedDate'] = pd.to_datetime(customerDets[' lastBookedDate'], dayfirst=True)
campaignDets['campaignDate'] = pd.to_datetime(campaignDets['campaignDate'], dayfirst=True)
Then, filter on when the campaign date is greater than last booked:
df = campaignDets[(campaignDets['campaignDate']>campaignDets['cust_id'].map(customerDets.set_index('cust_id')['lastBookedDate']))]
Finally, add your new column:
customerDets['touchesSinceBooked'] = customerDets['cust_id'].map(df.groupby('cust_id')['touch'].sum()).fillna(0)
You'll get
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2.0
1 2 20-04-2022 1.0
2 3 25-07-2022 0.0
3 4 10-06-2022 0.0
4 5 10-05-2022 0.0
5 6 10-08-2022 0.0
6 7 01-01-2021 2.0
7 8 02-06-2022 0.0
8 9 11-12-2021 2.0
9 10 10-05-2022 1.0