Problem
I need to write some python code to augment the visit_df table subject to date ranges in the customer_df table. Essentially, the program needs to gather every visit for a given custmer that has an VISIT_DTM in the date range (between START_DTM and END_DTM in the customer_df table) and then produce a result table where the attributes are gathered into a new cell for each row in visit_df.
I do have a solution but it is very slow due to a large dataset and the code is inefficient nested for loops. See code below. I suspect there is some combination of Python/Pandas/numpy functions that might be much faster and easier to comprehend.
Is there a better approach to this problem?
Here are the input tables:
visit_df
|CUST_ID|VISIT_ID|VISIT_DTM|
|-------|--------|----------|
|1|1|2009-08-28 00:00:00.000000|
|1|2|2009-08-31 00:00:00.000000|
|20|11|2009-08-27 00:00:00.000000|
|20|21|2009-08-31 00:00:00.000000|
customer_df
|CUST_ID|ATTR|START_DTM|END_DTM|
|-------|-------|------|------|
|1|'RED'|2008-08-27 00:00:00.000000|2008-08-28 00:00:00.000000|
|1|'BLUE'|2008-08-28 00:00:00.000000|2010-08-28 00:00:00.000000|
|1|'BLACK'|2009-08-29 00:00:00.000000|2010-08-28 00:00:00.000000|
|20|'YELLOW'|2008-08-26 00:00:00.000000|2009-08-28 00:00:00.000000|
|20|'BLUE'|2008-08-28 00:00:00.000000|2010-08-28 00:00:00.000000|
|20|BLACK|2009-08-29 00:00:00.000000|2010-09-03 00:00:00.000000|
Here is the desired result
result named augmented visit_df in the slow code
|CUST_ID|VISIT_ID|VISIT_DTM|ATTR_STR|ATTR_ARRAY|
|-------|--------|----------|-------|----------|
|1|1|2009-08-28 00:00:00.000000|'BLUE'|['BLUE']|
|1|2|2009-08-31 00:00:00.000000|'BLUE-BLACK'|['BLUE',BLACK']|
|20|11|2009-08-27 00:00:00.000000|'YELLOW-BLUE'|['YELLOW','BLUE']
|20|21|2009-08-31 00:00:00.000000|'BLUE-BLACK'|['BLUE','BLACK']
Very Slow Code
# Import pandas library
import pandas as pd
# Data
visit_data = [[1, 1, '2009-08-28 00:00:00.000000'],
[1, 2, '2009-08-31 00:00:00.000000'],
[20, 11, '2009-08-27 00:00:00.000000'],
[20, 21, '2009-08-31 00:00:00.000000']]
customer_attr = [[1,'RED', '2008-08-27 00:00:00.000000','2008-08-28 00:00:00.000000'],
[1,'BLUE', '2008-08-28 00:00:00.000000','2010-08-28 00:00:00.000000'],
[1, 'BLACK', '2009-08-29 00:00:00.000000', '2010-08-28 00:00:00.000000'],
[20,'YELLOW', '2008-08-26 00:00:00.000000','2009-08-28 00:00:00.000000'],
[20,'BLUE', '2008-08-28 00:00:00.000000','2010-08-28 00:00:00.000000'],
[20, 'BLACK', '2009-08-29 00:00:00.000000', '2010-09-03 00:00:00.000000'],
]
# Create the pandas DataFrames
visit_df = pd.DataFrame(visit_data, columns=['CUST_ID', 'VISIT_ID', 'VISIT_DTM'])
customer_df = pd.DataFrame(customer_attr, columns=['CUST_ID', 'ATTR', 'START_DTM', 'END_DTM'])
# Very slow code to augment the visit_df to produce a result df
visit_df['ATTR_STR'] = ''
visit_df['ATTR_ARRAY'] = ''
for index, visit_row in visit_df.iterrows():
attr_array = []
for index2, cust_row in customer_df.iterrows():
if (visit_row['CUST_ID'] == cust_row['CUST_ID']):
if cust_row['START_DTM'] <= visit_row['VISIT_DTM'] <= cust_row['END_DTM']:
attr_array.append(str(cust_row['ATTR']))
visit_df.at[index, 'ATTR_STR'] = '-'.join(attr_array)
visit_df.at[index, 'ATTR_ARRAY'] = attr_array
CodePudding user response:
The data you have is described by mathematical step functions, and staircase has been built upon pandas and numpy for this purpose.
It would be good to work with pandas' Timestamp
so, in addition to the setup code you provided, run the following
visit_df["VISIT_DTM"] = pd.to_datetime(visit_df["VISIT_DTM"])
customer_df["START_DTM"] = pd.to_datetime(customer_df["START_DTM"])
customer_df["END_DTM"] = pd.to_datetime(customer_df["END_DTM"])
solution
We'll create a step function for each customer and colour (CUST_ID and ATTR). A step function is represented by the staircase.Stairs class. This class is to staircase as Series is to pandas. To do this we group the dataframe on these variables and pass the sub-dataframes to the Stairs constructor.
import staircase as sc
stepfunctions = customer_df.groupby(["CUST_ID", "ATTR"]).apply(sc.Stairs, "START_DTM", "END_DTM")
Our stepfunctions
variable looks like this. It is a series, with a multi-index, and values are Stairs objects.
CUST_ID ATTR
1 BLACK <staircase.Stairs, id=2516845521032>
BLUE <staircase.Stairs, id=2516845756680>
RED <staircase.Stairs, id=2516845833800>
20 BLACK <staircase.Stairs, id=2516843896648>
BLUE <staircase.Stairs, id=2516845647752>
YELLOW <staircase.Stairs, id=2516845788808>
dtype: object
Next we will create a function which, given a customer id, extracts the visiting times from visit_df
and samples the stepfunctions for that customer, returning a dataframe.
def check_customer_visit(customer):
visit_times = visit_df.query(f"CUST_ID == {customer}")["VISIT_DTM"]
return sc.sample(stepfunctions[customer], visit_times).melt(ignore_index=False).assign(CUST_ID = customer)
Running this with customer = 1
gives
ATTR VISIT_DTM value CUST_ID
BLACK 2009-08-28 0 1
BLUE 2009-08-28 1 1
RED 2009-08-28 0 1
BLACK 2009-08-31 1 1
BLUE 2009-08-31 1 1
RED 2009-08-31 0 1
A value
of 0 means the attribute didn't apply during that visit.
So do this for each customer and concatenate the dataframes
data = pd.concat([check_customer_visit(customer) for customer in visit_df["CUST_ID"].unique()])
Then we can reset the index, filter rows where the attributes apply and drop the value
column
data.reset_index().query("value == 1").drop(columns="value")
This gives you
ATTR VISIT_DTM CUST_ID
1 BLUE 2009-08-28 1
3 BLACK 2009-08-31 1
4 BLUE 2009-08-31 1
7 BLUE 2009-08-27 20
8 YELLOW 2009-08-27 20
9 BLACK 2009-08-31 20
10 BLUE 2009-08-31 20
This is in tidy data format. If you want to get it into the format you submitted then you can groupby customer and visit time and apply list
to the ATTR
column.