I have df:
Product_Name Customer Start_Date End_Date Period
LP abc 29-Oct-16 3-Nov-16 Monthly
LP abc 2-Nov-16 8-Jan-17 Monthly
LP abc 5-Dec-16 31-Aug-17 Monthly
LP abc 7-Jan-17 16-Jun-17 Monthly
LP abc 1-Sep-17 30-Mar-18 Monthly
LP abc 17-Oct-17 31-Dec-17 Monthly
LP abc 20-Oct-18 31-Aug-19 Monthly
LP abc 1-Sep-19 27-Oct-20 Monthly
LP abc 28-Oct-20 31-Aug-21 Monthly
LP abc 1-Sep-21 9-Mar-22 Monthly
LP xyz 7-Sep-15 31-Aug-16 Monthly
LP xyz 1-Sep-16 14-Oct-16 Monthly
LP xyz 20-Oct-16 31-Aug-17 Monthly
LP xyz 1-Sep-17 31-Aug-18 Monthly
LP xyz 20-Oct-17 31-Aug-18 Annual
LP xyz 1-Sep-18 22-Oct-18 Annual
LP xyz 1-Sep-18 31-Aug-19 Monthly
LP xyz 1-Sep-19 27-Oct-20 Monthly
LP xyz 28-Oct-20 31-Aug-21 Monthly
LP xyz 1-Sep-21 9-Mar-22 Monthly
I want to take consecutive dates (as in one date before, one date after or at the same date from the end date) and club the dates together according to the Customer and Period Columns. For example, for customer abc, the earliest start date is 29-Oct-16 and continues till 16-Jun-17, so we need this as one transaction in the output file. Then next transaction in output should be start date: 5-Dec-16 and end date:30-Mar-18 and so on.
The expected output would look something like:
Product_Name Customer Start_Date End_Date Period
LP abc 29-Oct-16 16-Jun-17 Monthly
LP abc 5-Dec-16 30-Mar-18 Monthly
LP abc 17-Oct-17 31-Dec-17 Monthly
LP abc 20-Oct-18 9-Mar-22 Monthly
LP xyz 7-Sep-15 14-Oct-16 Monthly
LP xyz 20-Oct-16 9-Mar-22 Monthly
LP xyz 20-Feb-17 22-Oct-18 Annual
I have tried with the following approch:
df = Enhanced_Dataset.sort_values(['customer','Start_Date','Product_Name','Period'])
day_diff = (df['Start_Date'] - df['End_Date'].groupby([df['customer'], df['Product_name'], df['Period']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)|day_diff.lt(-1)).cumsum()
df_out = (df.groupby(['customer', group_no,'Product_Name','Period'], dropna=False, as_index=False)
.agg({'customer': 'first',
'Start_Date': 'first',
'End_Date': lambda x: x.iloc[-1],
}))
But getting the wrong result for example for customer abc as:
Product_Name Period customer Start_date End_Date
LP Monthly abc 2016-10-29 2017-01-08
LP Monthly abc 2016-12-05 2017-08-31
LP Monthly abc 2017-01-07 2017-06-16
LP Monthly abc 2017-09-01 2018-03-30
LP Monthly abc 2017-10-17 2017-12-31
LP Monthly abc 2018-10-20 2022-03-09
in the above result it can be observe that first row and third row is continuous and then second row and fourth row is continuous. Can anyone help me to find what is wrong and how can i improve the code to get the expected out.
CodePudding user response:
What I understand you are trying to do is to take all pairs of transactions within a ['Customer', 'Product_Name', 'Period']
group and if the pairs satisfy the constraint that the End_Date
of one member is within 1 day of Start_Date
of the other member - we treat this pair of transactions as the same group.
In other words, you want to build a "graph" of transactions which you can do using networkx
The outline of the code is as follows -
- Group by
['Customer', 'Product_Name', 'Period']
- Add each transaction within the group as a node to the Graph
- Within each group construct all possible pairs (
itertools.combinations
) - Pairs that satisfy the 1-day constraint make edges with each other
- Get the nx.connected_components
- Group by the component
import networkx as nx
from datetime import timedelta
from itertools import combinations
from collections import namedtuple
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
df_grouped = df.groupby(['Customer', 'Product_Name', 'Period'])
trans_rec = namedtuple('trans_rec', ['idx', 'start_date', 'end_date'])
for group in df_grouped.groups:
_transactions_df = df_grouped.get_group(group)
_transactions = zip(_transactions_df.index, _transactions_df['Start_Date'], _transactions_df['End_Date'])
G = nx.Graph()
for _ in _transactions:
G.add_node(trans_rec(*_))
node_pairs = combinations(G.nodes, 2)
for node_pair in node_pairs:
node_1, node_2 = node_pair
if timedelta(days=-1) <= node_2.end_date - node_1.start_date <= timedelta(days=1):
G.add_edge(node_1, node_2)
if timedelta(days=-1) <= node_1.end_date - node_2.start_date <= timedelta(days=1):
G.add_edge(node_1, node_2)
index_group_mapping = dict()
for ix, cc in enumerate(nx.connected_components(G)):
for node in cc:
index_group_mapping[node.idx] = '_'.join(group) '_' str(ix)
_transactions_df['index_group'] = _transactions_df.index.map(index_group_mapping)
agg_transactions = _transactions_df.groupby(['index_group']).agg({'Start_Date': 'min', 'End_Date': 'max'})
agg_transactions = agg_transactions.reset_index()
base_cols = agg_transactions.columns
agg_transactions = agg_transactions.join(agg_transactions[
'index_group'].str.split('_', expand=True))
new_cols = ['Customer', 'Product_Name', 'Period', 'group_num']
agg_transactions.columns = list(base_cols) list(new_cols)
agg_transactions = agg_transactions[['Start_Date', 'End_Date', 'Customer', 'Product_Name', 'Period']]
print(agg_transactions)
# Start_Date End_Date Customer Product_Name Period
# 0 2016-10-29 2017-06-16 abc LP Monthly
# 1 2016-12-05 2018-03-30 abc LP Monthly
# 2 2017-10-17 2017-12-31 abc LP Monthly
# 3 2018-10-20 2022-03-09 abc LP Monthly
# Start_Date End_Date Customer Product_Name Period
# 0 2017-10-20 2018-10-22 xyz LP Annual
# Start_Date End_Date Customer Product_Name Period
# 0 2015-09-07 2016-10-14 xyz LP Monthly
# 1 2016-10-20 2022-03-09 xyz LP Monthly