I cant use external libraries other than pandas or numpy
If I had a dataframe like
import pandas as pd
d = pd.DataFrame({'purchase_date':[datetime(2022,1,5),datetime(2022,12,10)],'shop_visits':[[datetime(2021,1,4),datetime(2022,1,6),datetime(2022,1,7)],[datetime(2022,5,30),datetime(2022,9,30),datetime(2022,12,10)]]})
purchase_dt | shop_visits |
---|---|
2022-01-05 | [2022-01-04,2022-01-06,2022-01-07] |
2022-12-10 | [2022-05-30,2022-09-30,2022-12-10] |
I would need a function that returns 2022-01-06 for the first row and 2022-12-10 for the second, because both are the closest dates on the list after purchase_dt.
I have a function with apply already that does the min with my own lambda key, but its taking too long and I want to vectorize:
d.apply(lambda x: min([d for d in x['shop_visits'] if d >= x['purchase_date']],key=lambda d: d-x['purchase_date']),axis=1)
The thing is that I am not sure if this can be done using numpy.
CodePudding user response:
here is one way to accomplish it
explode shop-visits, create a df2, where purchase_date is same or less than shop_visit. finally do a groupby to choose the first result
df2=df.explode('shop_visits')
df2=df2[df2['purchase_date']<=df2['shop_visits']]
df2[df2.groupby(['purchase_date']).cumcount().eq(0)]
OR
df2=df.explode('shop_visits')
df2[df2['purchase_date']<=df2['shop_visits']].groupby('purchase_date').first().reset_index()
purchase_date shop_visits
0 2022-01-05 2022-01-06
1 2022-12-10 2022-12-10
CodePudding user response:
You can try merge_asof
:
left = d.drop(columns=["shop_visits"]).reset_index()
right = d["shop_visits"].explode().sort_values().to_frame().reset_index()
pd.merge_asof(
left,
right,
by="index",
left_on="purchase_date",
right_on="shop_visits",
direction="forward",
).drop(columns=["index"])
Here's a faster version without merge_asof
:
tmp = (
# Extract purchase_date and shop_visits
d[["purchase_date", "shop_visits"]]
# Explode shop_visits
.explode("shop_visits")
# Reset index so we can align using this index later
.reset_index()
# Since we only care about visits on or after the purchase_date,
# filter out the rest
.query("purchase_date <= shop_visits")
# Sort by the original index and shop visit date
.sort_values(["index", "shop_visits"])
# For each index, keep only the first visit
.drop_duplicates("index")
)
# Set the original index back on tmp and so that we can
# easily align it with the original dataframe
d["first_visit_after_purchase"] = tmp.set_index("index")[["shop_visits"]]
# Note that some first_visit_after_purchase may be NaN due
# to our query condition
Took 7 seconds on my ancient Mac when d
has 1M rows and 10 visits for each row.
The processing time grows linearly with the total number of shop visits (len(d["shop_visits"].explode())
).