Home > Net >  grouping customer by start and end date
grouping customer by start and end date

Time:03-15

I have df:

Product_Name  Customer    Start_Date  End_Date    Period
      LP    abc           29-Oct-16 3-Nov-16    Monthly
      LP    abc           2-Nov-16  8-Jan-17    Monthly
      LP    abc           5-Dec-16  31-Aug-17   Monthly
      LP    abc           7-Jan-17  16-Jun-17   Monthly
      LP    abc           1-Sep-17  30-Mar-18   Monthly
      LP    abc           17-Oct-17 31-Dec-17   Monthly
      LP    abc           20-Oct-18 31-Aug-19   Monthly
      LP    abc            1-Sep-19 27-Oct-20   Monthly
      LP    abc           28-Oct-20 31-Aug-21   Monthly
      LP    abc            1-Sep-21 9-Mar-22    Monthly
      LP    xyz            7-Sep-15 31-Aug-16   Monthly
      LP    xyz            1-Sep-16 14-Oct-16   Monthly
      LP    xyz           20-Oct-16 31-Aug-17   Monthly
      LP    xyz            1-Sep-17 31-Aug-18   Monthly
      LP    xyz           20-Oct-17 31-Aug-18   Annual
      LP    xyz            1-Sep-18 22-Oct-18   Annual
      LP    xyz            1-Sep-18 31-Aug-19   Monthly
      LP    xyz            1-Sep-19 27-Oct-20   Monthly
      LP    xyz           28-Oct-20 31-Aug-21   Monthly
      LP    xyz            1-Sep-21 9-Mar-22    Monthly

I want to take consecutive dates (as in one date before, one date after or at the same date from the end date) and club the dates together according to the Customer and Period Columns. For example, for customer abc, the earliest start date is 29-Oct-16 and continues till 16-Jun-17, so we need this as one transaction in the output file. Then next transaction in output should be start date: 5-Dec-16 and end date:30-Mar-18 and so on.

The expected output would look something like:

Product_Name    Customer    Start_Date  End_Date    Period
      LP        abc         29-Oct-16   16-Jun-17   Monthly
      LP        abc         5-Dec-16    30-Mar-18   Monthly
      LP        abc         17-Oct-17   31-Dec-17   Monthly
      LP        abc         20-Oct-18   9-Mar-22    Monthly
      LP        xyz         7-Sep-15    14-Oct-16   Monthly
      LP        xyz         20-Oct-16   9-Mar-22    Monthly
      LP        xyz         20-Feb-17   22-Oct-18   Annual

I have tried with the following approch:

 df = Enhanced_Dataset.sort_values(['customer','Start_Date','Product_Name','Period'])
 day_diff = (df['Start_Date'] - df['End_Date'].groupby([df['customer'], df['Product_name'], df['Period']]).shift()).dt.days

 group_no = (day_diff.isna() | day_diff.gt(1)|day_diff.lt(-1)).cumsum()
 df_out = (df.groupby(['customer', group_no,'Product_Name','Period'], dropna=False, as_index=False)
        .agg({'customer': 'first',
              'Start_Date': 'first',
              'End_Date': lambda x: x.iloc[-1],
            }))

But getting the wrong result for example for customer abc as:

Product_Name    Period  customer    Start_date  End_Date
LP              Monthly abc     2016-10-29  2017-01-08  
LP              Monthly abc     2016-12-05  2017-08-31  
LP              Monthly abc     2017-01-07  2017-06-16  
LP              Monthly abc     2017-09-01  2018-03-30  
LP              Monthly abc     2017-10-17  2017-12-31  
LP              Monthly abc     2018-10-20  2022-03-09

in the above result it can be observe that first row and third row is continuous and then second row and fourth row is continuous. Can anyone help me to find what is wrong and how can i improve the code to get the expected out.

CodePudding user response:

What I understand you are trying to do is to take all pairs of transactions within a ['Customer', 'Product_Name', 'Period'] group and if the pairs satisfy the constraint that the End_Date of one member is within 1 day of Start_Date of the other member - we treat this pair of transactions as the same group.

In other words, you want to build a "graph" of transactions which you can do using networkx

The outline of the code is as follows -

  1. Group by ['Customer', 'Product_Name', 'Period']
  2. Add each transaction within the group as a node to the Graph
  3. Within each group construct all possible pairs (itertools.combinations)
  4. Pairs that satisfy the 1-day constraint make edges with each other
  5. Get the nx.connected_components
  6. Group by the component
import networkx as nx
from datetime import timedelta
from itertools import combinations
from collections import namedtuple
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
df_grouped = df.groupby(['Customer', 'Product_Name', 'Period'])
trans_rec = namedtuple('trans_rec', ['idx', 'start_date', 'end_date'])
for group in df_grouped.groups:
    _transactions_df = df_grouped.get_group(group)
    _transactions = zip(_transactions_df.index, _transactions_df['Start_Date'], _transactions_df['End_Date'])
    G = nx.Graph()
    for _ in _transactions:
        G.add_node(trans_rec(*_))
    node_pairs = combinations(G.nodes, 2)
    for node_pair in node_pairs:
        node_1, node_2 = node_pair
        if timedelta(days=-1) <= node_2.end_date - node_1.start_date <= timedelta(days=1):
            G.add_edge(node_1, node_2)
        if timedelta(days=-1) <= node_1.end_date - node_2.start_date <= timedelta(days=1):
            G.add_edge(node_1, node_2)
    index_group_mapping = dict()
    for ix, cc in enumerate(nx.connected_components(G)):
        for node in cc:
            index_group_mapping[node.idx] = '_'.join(group)   '_'   str(ix)
    _transactions_df['index_group'] = _transactions_df.index.map(index_group_mapping)
    agg_transactions = _transactions_df.groupby(['index_group']).agg({'Start_Date': 'min', 'End_Date': 'max'})
        agg_transactions = agg_transactions.reset_index()
        base_cols = agg_transactions.columns
        agg_transactions = agg_transactions.join(agg_transactions[
'index_group'].str.split('_', expand=True))
        new_cols = ['Customer', 'Product_Name', 'Period', 'group_num']
        agg_transactions.columns = list(base_cols)   list(new_cols)
        agg_transactions = agg_transactions[['Start_Date', 'End_Date', 'Customer', 'Product_Name', 'Period']]
        print(agg_transactions)


#   Start_Date   End_Date Customer Product_Name   Period
# 0 2016-10-29 2017-06-16      abc           LP  Monthly
# 1 2016-12-05 2018-03-30      abc           LP  Monthly
# 2 2017-10-17 2017-12-31      abc           LP  Monthly
# 3 2018-10-20 2022-03-09      abc           LP  Monthly
#   Start_Date   End_Date Customer Product_Name  Period
# 0 2017-10-20 2018-10-22      xyz           LP  Annual
#   Start_Date   End_Date Customer Product_Name   Period
# 0 2015-09-07 2016-10-14      xyz           LP  Monthly
# 1 2016-10-20 2022-03-09      xyz           LP  Monthly
  • Related