Home > Software design >  Find rows with similar values in another dataframe
Find rows with similar values in another dataframe

Time:11-28

Edit: Updated with expected output and explanations

I have a table 'Table1' as:

Fruit  Site      Feature1  Feature2  Feature3
Apple  Sydney    A1        B1        C1 
Banana Sydney    A1        B1        C1 
Cherry Sydney    A1        B2        C1 
Durian Melbourne A1        B1        C2 
Grape  Melbourne A2        B2        C2 

Another table, 'Table2' as:

Order  Site      Feature1  Feature2  Feature3
XX     Sydney    A2        B1        C1  
XY     Sydney    A1        B1        C1 
XZ     Sydney    A1        B1        C2 
YY     Melbourne A1        B1        C1 
YZ     Melbourne A1        B1        C1 
ZZ     Melbourne A2        B1        C1 

Table2 has more rows than Table1. Features and Orders have text values.

I am trying to do the following:

For each fruit and at the same site, find the order with the most number of matches across Feature1,Feature2,Feature3. Here, a high number matches is (for example) the same values for all three features. A lower match is if there is only same values for only two features.

So starting with the first fruit, search for order (at the same site) which has most number of matches, then go to next fruit, and only look for orders which are left after the first match.

Expected Output

Fruit      Order  Matches    
Apple      XY     3
Banana     XX     2
Cherry     XZ     1
Durian     YY     2
Grape      ZZ     1 

Apple has Order XY because it has the most number of matches. With Order XY already matched, the best match for Banana is Order XX. Even though Order XX and XZ have same number of matches with Banana, the earlier order is preferred. For example, Durian was matched to order YY and not order YZ, even though they have same number of matches.

If a fruit has no matches, it won't be included in the output.

My plan was to use merge on site, so we see all orders for a fruit at the same site, and then select the one with highest matches. However, I don't know how I will make sure that the matched order does not show up for the rest of the fruits.

How can I solve this?

CodePudding user response:

This is a perfect use case for melt as starting point before merge your two dataframes. melt flat your value columns (FeatureX). After merging, you have two columns values_x (features from df1) and values_y (features from df2) you need to compare.

Now, with query, keep rows where this 2 columns are equals. Then, use value_counts on (Fruit, Order) columns then reformat the dataframe with rename and reset_index. Finally, drop_duplicates on Fruit column to keep the first count, the highest value because the Matches column is already sorted.

You can execute this one-line step by step to see the transformation of the dataframe:

out = pd.merge(df1.melt(['Fruit', 'Site']),
               df2.melt(['Order', 'Site']),
               on=['Site', 'variable']) \
        .query('value_x == value_y') \
        .value_counts(['Fruit', 'Order']) \
        .rename('Matches') \
        .reset_index() \
        .drop_duplicates('Fruit')

Final output:

>>> out
     Fruit Order  Matches
0    Apple    XY        3
1   Banana    XY        3
6   Cherry    XY        2
7   Durian    YY        2
12   Grape    ZZ        1

Note: check carefully my result because it's not equal to your output.

CodePudding user response:

One way you can do this is to group by "Fruit" and "Site" by setting them as the index and then finding the number of unique values in each row

>>> import pandas as pd

>>> df = pd.DataFrame(
...    {
...        "Fruit": ["Apple", "Banana", "Grape"],
...        "Site": ["Sydney", "Melbourne", "Melbourne"],
...        "Feature1": [1, 2, 3],
...        "Feature2": [1, 1, 1],
...        "Feature3": [1, 3, 3],
...    }
... )
>>>
>>> df = df.set_index(['Fruit', 'Site'])
>>>
>>> df.loc[df.nunique(axis=1) == 1, 'Order'] = 1
>>> df.loc[df.nunique(axis=1) > 1, 'Order'] = 2
>>>
>>> df
                  Feature1  Feature2  Feature3  Order
Fruit  Site
Apple  Sydney            1         1         1    1.0
Banana Melbourne         2         1         3    2.0
Grape  Melbourne         3         1         3    2.0
>>>
>>> # If you just want the counts
>>>
>>> df['Order2'] = df.nunique(axis=1)
>>> df
                  Feature1  Feature2  Feature3  Order  Order2
Fruit  Site
Apple  Sydney            1         1         1    1.0       1
Banana Melbourne         2         1         3    2.0       3
Grape  Melbourne         3         1         3    2.0       3

Update:

You changed the question after I provided this answer. Your question is completely different now.

  • Related