How to find out which keys are involved in the join-CodePudding

df1 contains products characteristics in different columns. In the second dataframe, the characteristics of the products are located in one column. df2 are joined (left) to df1 using three keys as many times as there are columns of characteristics in the df1. 1.How to add a column to the second dataframe indicating whether the row is involved in the join? 2.How to make join more elegant))

d1 = {"Product": ["product1", "product1", "product2", "product1", "product2", "product1", "product3"], "Store": ["Store1", "Store2", "Store1", "Store1", "Store1", "Store2", "Store2"], "Attribute1": pd.Series(["red", "green", "green", "blue"], index=[0, 2, 5, 6]), "Attribute2": pd.Series(["red", "green", "red"], index=[1, 3, 4])}
df1 = pd.DataFrame(data=d1)

d2 = {"Product": ["product1", "product1", "product2", "product1", "product2", "product1", "product3", "product3", "product2"], "Store": ["Store1", "Store2", "Store1", "Store1", "Store1", "Store2", "Store2", "Store1", "Store2"], "Attribute": ["red", "red", "green", "green", "red", "green", "blue", "blue", "red"], "Package": ["type1", "type2", "type3", "type4", "type5", "type6", "type2", "type4", "type2"]}
df2 = pd.DataFrame(data=d2)

df3 = pd.merge(df1, df2, how = 'left', left_on = ['Product', 'Store', 'Attribute1'], right_on = ['Product', 'Store', 'Attribute'])
df3 = df3.drop(['Attribute'], axis=1)

df3 = pd.merge(df3, df2, how = 'left', left_on = ['Product', 'Store', 'Attribute2'], right_on = ['Product', 'Store', 'Attribute'])
df3 = df3.drop(['Attribute'], axis=1)

df3["Package"] = df3["Package_x"].map(str, na_action=None)   df3["Package_y"].map(str, na_action=None)
df3 = df3.drop(['Package_x'], axis=1)
df3 = df3.drop(['Package_y'], axis=1)
df3["Package"] = df3["Package"].str.replace("nan","")

CodePudding user response：

IIUC, use indicator parameter:

df3 = pd.merge(df1, df2, how = 'left', left_on = ['Product', 'Store', 'Attribute1'], right_on = ['Product', 'Store', 'Attribute'], indicator=True)
df3 = df3.drop(['Attribute'], axis=1)
df3

Output:

    Product   Store Attribute1 Attribute2     _merge Package
0  product1  Store1        red        NaN       both   type1
1  product1  Store2        NaN        red  left_only   type2
2  product2  Store1      green        NaN       both   type3
3  product1  Store1        NaN      green  left_only   type4
4  product2  Store1        NaN        red  left_only   type5
5  product1  Store2      green        NaN       both   type6
6  product3  Store2       blue        NaN       both   type2

CodePudding user response：

If Attribute1 and Attribute2 columns are mutually exclusive, you can create a new column Attribute in df1 before merge:

out = (df1.assign(Attribute=df1['Attribute1'].fillna(df1['Attribute2']))
          .merge(df2, on=['Product', 'Store', 'Attribute'], how='left')
          .drop(columns='Attribute'))
print(out)

# Output
    Product   Store Attribute1 Attribute2 Package
0  product1  Store1        red        NaN   type1
1  product1  Store2        NaN        red   type2
2  product2  Store1      green        NaN   type3
3  product1  Store1        NaN      green   type4
4  product2  Store1        NaN        red   type5
5  product1  Store2      green        NaN   type6
6  product3  Store2       blue        NaN   type2

Update: if you have multiple AttributeX columns with a same prefix like "Attribute", you can replace:

df1['Attribute1'].fillna(df1['Attribute2'])

by:

df1.filter(like='Attribute').ffill(axis=1).iloc[:, -1]