Vectorize a function to get closest date from a list of dates-CodePudding

I cant use external libraries other than pandas or numpy

If I had a dataframe like

import pandas as pd
d = pd.DataFrame({'purchase_date':[datetime(2022,1,5),datetime(2022,12,10)],'shop_visits':[[datetime(2021,1,4),datetime(2022,1,6),datetime(2022,1,7)],[datetime(2022,5,30),datetime(2022,9,30),datetime(2022,12,10)]]})

purchase_dt	shop_visits
2022-01-05	[2022-01-04,2022-01-06,2022-01-07]
2022-12-10	[2022-05-30,2022-09-30,2022-12-10]

I would need a function that returns 2022-01-06 for the first row and 2022-12-10 for the second, because both are the closest dates on the list after purchase_dt.

I have a function with apply already that does the min with my own lambda key, but its taking too long and I want to vectorize:

d.apply(lambda x: min([d for d in x['shop_visits'] if d >= x['purchase_date']],key=lambda d: d-x['purchase_date']),axis=1)

The thing is that I am not sure if this can be done using numpy.

CodePudding user response：

here is one way to accomplish it

explode shop-visits, create a df2, where purchase_date is same or less than shop_visit. finally do a groupby to choose the first result

df2=df.explode('shop_visits')
df2=df2[df2['purchase_date']<=df2['shop_visits']]
df2[df2.groupby(['purchase_date']).cumcount().eq(0)]

df2=df.explode('shop_visits')
df2[df2['purchase_date']<=df2['shop_visits']].groupby('purchase_date').first().reset_index()


    purchase_date   shop_visits
0   2022-01-05      2022-01-06
1   2022-12-10      2022-12-10

CodePudding user response：

You can try merge_asof:

left = d.drop(columns=["shop_visits"]).reset_index()
right = d["shop_visits"].explode().sort_values().to_frame().reset_index()

pd.merge_asof(
    left,
    right,
    by="index",
    left_on="purchase_date",
    right_on="shop_visits",
    direction="forward",
).drop(columns=["index"])

Here's a faster version without merge_asof:

tmp = (
    # Extract purchase_date and shop_visits
    d[["purchase_date", "shop_visits"]]
    # Explode shop_visits
    .explode("shop_visits")
    # Reset index so we can align using this index later
    .reset_index()
    # Since we only care about visits on or after the purchase_date,
    # filter out the rest
    .query("purchase_date <= shop_visits")
    # Sort by the original index and shop visit date
    .sort_values(["index", "shop_visits"])
    # For each index, keep only the first visit
    .drop_duplicates("index")
)

# Set the original index back on tmp and so that we can
# easily align it with the original dataframe
d["first_visit_after_purchase"] = tmp.set_index("index")[["shop_visits"]]

# Note that some first_visit_after_purchase may be NaN due
# to our query condition

Took 7 seconds on my ancient Mac when d has 1M rows and 10 visits for each row.

The processing time grows linearly with the total number of shop visits (len(d["shop_visits"].explode())).