thanks in advance for help. I have two dataframes as given below. I need to create column category in sold frame based on information in size frame. It should check siz of product within Min and Max sizes for this product and return group. Is it possible to do it in pandas? not SQL. I think merge and join method will not work here.
size=pd.DataFrame({"Min Size":[30,41,40],
"Max Size":[40, 60, 50],
"Category":['small', 'big', "medium"],
"Product":['Apple', 'Apple', "Peach"]})
sold=pd.DataFrame({"Purchase_date":["20/01/2020", "18/02/2020", "01/06/2020"],
"Size":[35, 45, 42],
"Category":["small","big","medium"],
"Product":['Apple', 'Peach', "Apple"]})
CodePudding user response:
Joining condition in pandas must be exact match. It doesn't have the BETWEEN ... AND ...
clause like in SQL.
You can use numpy broadcast to compare every row in sold
to every row in size
and filter for a match:
# Converting everything to numpy for comparison
sold_product = sold["Product"].to_numpy()[:, None]
sold_size = sold["Size"].to_numpy()[:, None]
product, min_size, max_size = size[["Product", "Min Size", "Max Size"]].T.to_numpy()
# Compare every row in `sold` to every row in `size`.
# `mask` is a len(sold) * len(size) matrix whose value
# indicate if row i in `sold` matches row j in `size`
mask = (sold_product == product) & (min_size <= sold_size) & (sold_size <= max_size)
# For each row in `sold`, find the first row in `size` that
# is True / non-zero
idx, join_key = mask.nonzero()
sold.loc[idx, "join_key"] = join_key
# Result
sold.merge(
size[["Category"]],
how="left",
left_on="join_key",
right_index=True,
suffixes=("_Expected", "_Actual"),
)