Edit: Updated with expected output and explanations
I have a table 'Table1' as:
Fruit Site Feature1 Feature2 Feature3
Apple Sydney A1 B1 C1
Banana Sydney A1 B1 C1
Cherry Sydney A1 B2 C1
Durian Melbourne A1 B1 C2
Grape Melbourne A2 B2 C2
Another table, 'Table2' as:
Order Site Feature1 Feature2 Feature3
XX Sydney A2 B1 C1
XY Sydney A1 B1 C1
XZ Sydney A1 B1 C2
YY Melbourne A1 B1 C1
YZ Melbourne A1 B1 C1
ZZ Melbourne A2 B1 C1
Table2 has more rows than Table1. Features and Orders have text values.
I am trying to do the following:
For each fruit and at the same site, find the order with the most number of matches across Feature1,Feature2,Feature3
. Here, a high number matches is (for example) the same values for all three features. A lower match is if there is only same values for only two features.
So starting with the first fruit, search for order (at the same site) which has most number of matches, then go to next fruit, and only look for orders which are left after the first match.
Expected Output
Fruit Order Matches
Apple XY 3
Banana XX 2
Cherry XZ 1
Durian YY 2
Grape ZZ 1
Apple has Order XY because it has the most number of matches. With Order XY already matched, the best match for Banana is Order XX. Even though Order XX and XZ have same number of matches with Banana, the earlier order is preferred. For example, Durian was matched to order YY and not order YZ, even though they have same number of matches.
If a fruit has no matches, it won't be included in the output.
My plan was to use merge on site, so we see all orders for a fruit at the same site, and then select the one with highest matches. However, I don't know how I will make sure that the matched order does not show up for the rest of the fruits.
How can I solve this?
CodePudding user response:
This is a perfect use case for melt
as starting point before merge
your two dataframes. melt
flat your value columns (FeatureX
). After merging, you have two columns values_x
(features from df1) and values_y
(features from df2) you need to compare.
Now, with query
, keep rows where this 2 columns are equals. Then, use value_counts
on (Fruit, Order)
columns then reformat the dataframe with rename
and reset_index
. Finally, drop_duplicates
on Fruit
column to keep the first count, the highest value because the Matches
column is already sorted.
You can execute this one-line step by step to see the transformation of the dataframe:
out = pd.merge(df1.melt(['Fruit', 'Site']),
df2.melt(['Order', 'Site']),
on=['Site', 'variable']) \
.query('value_x == value_y') \
.value_counts(['Fruit', 'Order']) \
.rename('Matches') \
.reset_index() \
.drop_duplicates('Fruit')
Final output:
>>> out
Fruit Order Matches
0 Apple XY 3
1 Banana XY 3
6 Cherry XY 2
7 Durian YY 2
12 Grape ZZ 1
Note: check carefully my result because it's not equal to your output.
CodePudding user response:
One way you can do this is to group by "Fruit" and "Site" by setting them as the index and then finding the number of unique values in each row
>>> import pandas as pd
>>> df = pd.DataFrame(
... {
... "Fruit": ["Apple", "Banana", "Grape"],
... "Site": ["Sydney", "Melbourne", "Melbourne"],
... "Feature1": [1, 2, 3],
... "Feature2": [1, 1, 1],
... "Feature3": [1, 3, 3],
... }
... )
>>>
>>> df = df.set_index(['Fruit', 'Site'])
>>>
>>> df.loc[df.nunique(axis=1) == 1, 'Order'] = 1
>>> df.loc[df.nunique(axis=1) > 1, 'Order'] = 2
>>>
>>> df
Feature1 Feature2 Feature3 Order
Fruit Site
Apple Sydney 1 1 1 1.0
Banana Melbourne 2 1 3 2.0
Grape Melbourne 3 1 3 2.0
>>>
>>> # If you just want the counts
>>>
>>> df['Order2'] = df.nunique(axis=1)
>>> df
Feature1 Feature2 Feature3 Order Order2
Fruit Site
Apple Sydney 1 1 1 1.0 1
Banana Melbourne 2 1 3 2.0 3
Grape Melbourne 3 1 3 2.0 3
Update:
You changed the question after I provided this answer. Your question is completely different now.