Home > Blockchain >  How to efficiently join dataframes based on dates within intervals
How to efficiently join dataframes based on dates within intervals

Time:10-25

Problem

I need to write some python code to augment the visit_df table subject to date ranges in the customer_df table. Essentially, the program needs to gather every visit for a given custmer that has an VISIT_DTM in the date range (between START_DTM and END_DTM in the customer_df table) and then produce a result table where the attributes are gathered into a new cell for each row in visit_df.

I do have a solution but it is very slow due to a large dataset and the code is inefficient nested for loops. See code below. I suspect there is some combination of Python/Pandas/numpy functions that might be much faster and easier to comprehend.

Is there a better approach to this problem?

Here are the input tables:

visit_df

|CUST_ID|VISIT_ID|VISIT_DTM|
|-------|--------|----------|
|1|1|2009-08-28 00:00:00.000000|
|1|2|2009-08-31 00:00:00.000000|
|20|11|2009-08-27 00:00:00.000000|
|20|21|2009-08-31 00:00:00.000000|

customer_df

|CUST_ID|ATTR|START_DTM|END_DTM|
|-------|-------|------|------|
|1|'RED'|2008-08-27 00:00:00.000000|2008-08-28 00:00:00.000000|
|1|'BLUE'|2008-08-28 00:00:00.000000|2010-08-28 00:00:00.000000|
|1|'BLACK'|2009-08-29 00:00:00.000000|2010-08-28 00:00:00.000000|
|20|'YELLOW'|2008-08-26 00:00:00.000000|2009-08-28 00:00:00.000000|
|20|'BLUE'|2008-08-28 00:00:00.000000|2010-08-28 00:00:00.000000|
|20|BLACK|2009-08-29 00:00:00.000000|2010-09-03 00:00:00.000000|

Here is the desired result

result named augmented visit_df in the slow code

|CUST_ID|VISIT_ID|VISIT_DTM|ATTR_STR|ATTR_ARRAY|
|-------|--------|----------|-------|----------|
|1|1|2009-08-28 00:00:00.000000|'BLUE'|['BLUE']|
|1|2|2009-08-31 00:00:00.000000|'BLUE-BLACK'|['BLUE',BLACK']|
|20|11|2009-08-27 00:00:00.000000|'YELLOW-BLUE'|['YELLOW','BLUE']
|20|21|2009-08-31 00:00:00.000000|'BLUE-BLACK'|['BLUE','BLACK']

Very Slow Code

    # Import pandas library
    import pandas as pd
    # Data
    visit_data = [[1, 1, '2009-08-28 00:00:00.000000'],
            [1, 2, '2009-08-31 00:00:00.000000'],
            [20, 11, '2009-08-27 00:00:00.000000'],
            [20, 21, '2009-08-31 00:00:00.000000']] 
    customer_attr = [[1,'RED', '2008-08-27 00:00:00.000000','2008-08-28 00:00:00.000000'],
                    [1,'BLUE', '2008-08-28 00:00:00.000000','2010-08-28 00:00:00.000000'],
                    [1, 'BLACK', '2009-08-29 00:00:00.000000', '2010-08-28 00:00:00.000000'],
                    [20,'YELLOW', '2008-08-26 00:00:00.000000','2009-08-28 00:00:00.000000'],
                    [20,'BLUE', '2008-08-28 00:00:00.000000','2010-08-28 00:00:00.000000'],
                    [20, 'BLACK', '2009-08-29 00:00:00.000000', '2010-09-03 00:00:00.000000'],
                    ]
    # Create the pandas DataFrames
    visit_df = pd.DataFrame(visit_data, columns=['CUST_ID', 'VISIT_ID', 'VISIT_DTM'])
    customer_df = pd.DataFrame(customer_attr, columns=['CUST_ID', 'ATTR', 'START_DTM', 'END_DTM'])
    
    
    # Very slow code to augment the visit_df to produce a result df
    visit_df['ATTR_STR'] = ''
    visit_df['ATTR_ARRAY'] = ''
    
    for index, visit_row in visit_df.iterrows():
        attr_array = []
        for index2, cust_row in customer_df.iterrows():
            if (visit_row['CUST_ID'] == cust_row['CUST_ID']):
                if cust_row['START_DTM'] <= visit_row['VISIT_DTM'] <= cust_row['END_DTM']:
                    attr_array.append(str(cust_row['ATTR']))
    
    
            visit_df.at[index, 'ATTR_STR'] = '-'.join(attr_array)
            visit_df.at[index, 'ATTR_ARRAY'] = attr_array

CodePudding user response:

The data you have is described by mathematical step functions, and staircase has been built upon pandas and numpy for this purpose.

It would be good to work with pandas' Timestamp so, in addition to the setup code you provided, run the following

visit_df["VISIT_DTM"] = pd.to_datetime(visit_df["VISIT_DTM"])
customer_df["START_DTM"] = pd.to_datetime(customer_df["START_DTM"])
customer_df["END_DTM"] = pd.to_datetime(customer_df["END_DTM"])

solution

We'll create a step function for each customer and colour (CUST_ID and ATTR). A step function is represented by the staircase.Stairs class. This class is to staircase as Series is to pandas. To do this we group the dataframe on these variables and pass the sub-dataframes to the Stairs constructor.

import staircase as sc

stepfunctions = customer_df.groupby(["CUST_ID", "ATTR"]).apply(sc.Stairs, "START_DTM", "END_DTM")

Our stepfunctions variable looks like this. It is a series, with a multi-index, and values are Stairs objects.

CUST_ID  ATTR  
1        BLACK     <staircase.Stairs, id=2516845521032>
         BLUE      <staircase.Stairs, id=2516845756680>
         RED       <staircase.Stairs, id=2516845833800>
20       BLACK     <staircase.Stairs, id=2516843896648>
         BLUE      <staircase.Stairs, id=2516845647752>
         YELLOW    <staircase.Stairs, id=2516845788808>
dtype: object

Next we will create a function which, given a customer id, extracts the visiting times from visit_df and samples the stepfunctions for that customer, returning a dataframe.

def check_customer_visit(customer):
    visit_times = visit_df.query(f"CUST_ID == {customer}")["VISIT_DTM"]
    return sc.sample(stepfunctions[customer], visit_times).melt(ignore_index=False).assign(CUST_ID = customer)

Running this with customer = 1 gives

ATTR   VISIT_DTM  value  CUST_ID                        
BLACK 2009-08-28      0        1
BLUE  2009-08-28      1        1
RED   2009-08-28      0        1
BLACK 2009-08-31      1        1
BLUE  2009-08-31      1        1
RED   2009-08-31      0        1

A value of 0 means the attribute didn't apply during that visit.

So do this for each customer and concatenate the dataframes

data = pd.concat([check_customer_visit(customer) for customer in visit_df["CUST_ID"].unique()])

Then we can reset the index, filter rows where the attributes apply and drop the value column

data.reset_index().query("value == 1").drop(columns="value")

This gives you

      ATTR  VISIT_DTM  CUST_ID
1     BLUE 2009-08-28        1
3    BLACK 2009-08-31        1
4     BLUE 2009-08-31        1
7     BLUE 2009-08-27       20
8   YELLOW 2009-08-27       20
9    BLACK 2009-08-31       20
10    BLUE 2009-08-31       20

This is in tidy data format. If you want to get it into the format you submitted then you can groupby customer and visit time and apply list to the ATTR column.

  • Related