Home > OS >  Group ids by 2 date interval columns and 2 other columns
Group ids by 2 date interval columns and 2 other columns

Time:04-06

I have the following dataframe:

ID Fruit Price Location Start_Date End_Date
01 Orange 12 ABC 01-03-2015 01-05-2015
01 Orange 9.5 ABC 01-03-2015 01-05-2015
02 Apple 10 PQR 04-09-2019 04-11-2019
06 Orange 11 ABC 01-04-2015 01-06-2015
05 Peach 15 XYZ 07-11-2021 07-13-2021
08 Apple 10.5 PQR 04-09-2019 04-11-2019
10 Apple 10 LMN 04-10-2019 04-12-2019
03 Peach 14.5 XYZ 07-11-2020 07-13-2020
11 Peach 12.5 ABC 01-04-2015 01-05-2015
12 Peach 12.5 ABC 01-03-2015 01-05-2015

I want to form a group of IDs that belong to the same location, fruit, and range of start date and end date. The date interval condition is that we only group those ids together whose start_date and end_date are no more than 3 days apart. Eg. ID 06 start_date is 01-04-2015 and end_date is 01-06-2015. ID 01 start_date is 01-03-2015 and end_date is 01-05-2015. So ID 06 and 01's start_date and end_date are only 1 day apart so the merge is acceptable (i.e. these two ids can be grouped together if other variables like location and fruit match).

Also, I only want to output groups with more than 1 unique IDs.

My output should be (the start date and end date is merged):

ID Fruit Price Location Start_Date End_Date
01 Orange 12 ABC 01-03-2015 01-06-2015
01 Orange 9.5
06 Orange 11
11 Peach 12.5
12 Peach 12.5
02 Apple 10 PQR 04-09-2019 04-11-2019
08 Apple 10.5

IDs 05,03 get filtered out because it's a single record (they dont meet the date interval condition). ID 10 gets filtered out because it's from a different location.

I have no idea how to merge intervals for 2 such date columns. I have tried a few techniques to test out grouping (without the date merge).

My latest one is using grouper.

output = df.groupby([pd.Grouper(key='Start_Date', freq='D'),pd.Grouper(key='End_Date', freq='D'),pd.Grouper(key='Location'),pd.Grouper(key='Fruit'),'ID']).agg(unique_emp=('ID', 'nunique'))

Need help getting the output. Thank you!!

CodePudding user response:

Here is a slow/non-vectorized approach where we "manually" walk through sorted date values and assign them to bins, incrementing to the next bin when the gap is too large. Uses a function to add new columns to the df. Edited so that the ID column is the index

from datetime import timedelta
import pandas as pd

#Setup
df = pd.DataFrame(
    columns = ['ID', 'Fruit', 'Price', 'Location', 'Start_Date', 'End_Date'],
    data = [
        [1, 'Orange', 12.0, 'ABC', '01-03-2015', '01-05-2015'],
        [1, 'Orange', 9.5, 'ABC', '01-03-2015', '01-05-2015'],
        [2, 'Apple', 10.0, 'PQR', '04-09-2019', '04-11-2019'],
        [6, 'Orange', 11.0, 'ABC', '01-04-2015', '01-06-2015'],
        [5, 'Peach', 15.0, 'XYZ', '07-11-2021', '07-13-2021'],
        [8, 'Apple', 10.5, 'PQR', '04-09-2019', '04-11-2019'],
        [10, 'Apple', 10.0, 'LMN', '04-10-2019', '04-12-2019'],
        [3, 'Peach', 14.5, 'XYZ', '07-11-2020', '07-13-2020'],
        [11, 'Peach', 12.5, 'ABC', '01-04-2015', '01-05-2015'],
        [12, 'Peach', 12.5, 'ABC', '01-03-2015', '01-05-2015'],
    ]
)

df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
df = df.set_index('ID')

#Function to bin the dates
def create_date_bin_series(dates, max_span=timedelta(days=3)):
    
    orig_order = zip(dates,range(len(dates)))
    sorted_order = sorted(orig_order)
    
    curr_bin = 1
    curr_date = min(dates)
    
    date_bins = []
    for date,i in sorted_order:
        if date-curr_date > max_span:
            curr_bin  = 1

        curr_date = date
        date_bins.append((curr_bin,i))
    
    #sort the date_bins to match the original order
    date_bins = [v for v,_ in sorted(date_bins, key = lambda x: x[1])]
    return date_bins
        
#Apply function to group each date into a bin with other dates within 3 days of it
start_bins = create_date_bin_series(df['Start_Date'])
end_bins = create_date_bin_series(df['End_Date'])

#Group by new columns
df['fruit_group'] = df.groupby(['Fruit','Location',start_bins,end_bins]).ngroup()

#Print the table sorted by these new groups
print(df.sort_values('fruit_group'))

#you can use the new fruit_group column to filter and agg etc

Output

output

CodePudding user response:

This is essentially a gap-and-island problem. If you sort your dataframe by Fruit, Location and Start Date, you can create islands (i.e. fruit group) as follow:

  • If the current row's Fruit or Location is not the same as the previous row's, start a new island
  • If the current row's End Date is more than 3 days after the island's Start Date, make a new island

The code:

for col in ["Start_Date", "End_Date"]:
    df[col] = pd.to_datetime(df[col])

# This algorithm requires a sorted dataframe
df = df.sort_values(["Fruit", "Location", "Start_Date"])

# Assign each row to an island
i = 0
islands = []
last_fruit, last_location, last_start = None, None, df["Start_Date"].iloc[0]

for _, (fruit, location, start, end) in df[["Fruit", "Location", "Start_Date", "End_Date"]].iterrows():
    if (fruit != last_fruit) or (location != last_location) or (end - last_start > pd.Timedelta(days=3)):
        i  = 1
        last_fruit, last_location, last_start = fruit, location, start
    else:
        last_fruit, last_location = fruit, location
    islands.append(i)

df["Island"] = islands

# Filter for islands having more than 1 rows
idx = pd.Series(islands).value_counts().loc[lambda c: c > 1].index
df[df["Island"].isin(idx)]
  • Related