Home > database >  Compare two dataframe column values and join with condition in python?
Compare two dataframe column values and join with condition in python?

Time:03-21

I need to join the below dataframe based on some condition.

df1:
    Id             Value
    [101,102,103]   10001
    [101,102,104]   10000
    [101,102,105]   10002
    [101,107,105]   10003

df2:

    Id                 Product_Name
    [101,102,103,104]  Shoe
    [101,102,109,104]  jeans
    [101,105,102,108]  make-up
    [101,105,106,118]  shirt

df_output

Id             Value    Product_Name
[101,102,103]   10001   Shoe           -- Every value present in df2.Id list
[101,102,104]   10000   Jeans          -- Every value present in df2.Id list
[101,102,105]   10002   Make-Up        -- Every value present in df2.Id list
[101,107,105]   10003   NaN            -- Not Every Value matches in df2.Id list.

I need to join two dataframe df1, df2 based on Id column but every element should be in df.Id list that's when we consider it a match.

Python Code:

import pandas as pd

df_output = pd.merge(df1, df2, on='Id', how='left')

CodePudding user response:

While this isn't a highly efficient solution, you can use some sets to solve this problem.

matches = df1["Id"].apply(set) <= df2["Id"].apply(set)

out = df1.copy()
out.loc[matches, df2.columns.difference(["Id"])] = df2

print(out)
                Id  Value Product_Name
0  [101, 102, 103]  10001         Shoe
1  [101, 102, 104]  10000        jeans
2  [101, 102, 105]  10002      make-up
3  [101, 107, 105]  10003          NaN

In the above snippet:

  1. matches = df1["Id"].apply(set) <= df2["Id"].apply(set) returns a boolean Series that is True where the contents of each row in df1['Id'] is in the corresponding row in df2['Id'], and False otherwise
  2. Instead of performing an actual merge we can simply align the 2 DataFrames on the aforementioned boolean Series

CodePudding user response:

IIUC, you could concatenate on axis=1; then check if df1.Id is a subset of df2.Id for each row and use where to mask "Product_Name" depending on check results.

out = pd.concat([df1, df2],  axis=1)
out['Product_Name'] = out['Product_Name'].where(out['Id'].apply(lambda x: set(x.iat[0]).issubset(x.iat[1]), axis=1))
out = out.loc[:, ~out.columns.duplicated()]

Output:

                Id  Value Product_Name
0  [101, 102, 103]  10001         Shoe
1  [101, 102, 104]  10000        jeans
2  [101, 102, 105]  10002      make-up
3  [101, 107, 105]  10003          NaN
  • Related