Pandas Merge on Multiple DataFrames and Results in One Column-CodePudding

I want to perform a pandas merge with different rules. I have a sales table below with PRODUCT NAMES, PRODUCT LABEL and PACKAGING ID.

Mapping Tables

My goal is to obtain the following table:

The only condition is that we have priorities:

If there is a mapping on PACKAGING ID -> Use that allocation
If there is not mapping on PACKAGING ID, then use PRODUCT LABEL
At the last resource, if there is nothing on both, use PRODUCT NAME.

Thank you.

Using pd.merge creates multiple columns on the dataframe...and i want the results in the same columns and aligned to the table.

    
    merged_df = pd.merge(database, mapping.package_id, left_on="Package ID", right_on="Package ID", how="left")
    merged_df_2 = pd.merge(merged_df, mapping.product_label, left_on="Product Label", right_on="Product_Label", how="left") 
    merged_df_3 = pd.merge(merged_df_2, mapping.product_name, left_on="Product Name", right_on="Product_Name", how="left")
    
    return merged_df_3

CodePudding user response：

Here's a way to do what you're asking using join():

df = dfSales.join(dfPack.set_index('PACKAGING ID'), on='PACKAGING ID')
df = df.join(dfLabel.set_index('PRODUCT LABEL'), on='PRODUCT LABEL', rsuffix='_LABEL')
df = df.join(dfName.set_index('PRODUCT NAME'), on='PRODUCT NAME', rsuffix='_NAME').reset_index(drop=True)

pack = df['Allocation %'].notna()
df.loc[~pack, 'Allocation %'] = df['Allocation %_LABEL']
df.loc[~(pack | df['Allocation %_LABEL'].notna()), 'Allocation %'] = df['Allocation %_NAME']

df = df.drop(columns=['Allocation %_LABEL', 'Allocation %_NAME']).drop_duplicates().reset_index(drop=True)

Input:

dfSales:
  PRODUCT NAME PRODUCT LABEL  PACKAGING ID  AMOUNT
0       Name_1      Label_66            57     123
1       Name_2      Label_67            10     123
2       Name_3      Label_68            52     123
3       Name_4      Label_69            16     123
4       Name_5      Label_70            48     123
5       Name_6      Label_71            21     123
6       Name_7      Label_72            33     123
7       Name_8      Label_73            75     123
8       Name_9      Label_74            19     123
9      Name_10      Label_75            97     123

dfPack:
   PACKAGING ID  Allocation %
0            52           0.1
1            52           0.2
2            52           0.7
3            16           0.2
4            16           0.2
5            16           0.2
6            16           0.4

dfLabel:
   PRODUCT LABEL  Allocation %
0       Label_74           0.5
1       Label_74           0.5
2       Label_75           0.1
3       Label_75           0.1
4       Label_75           0.1
5       Label_75           0.1
6       Label_75           0.1
7       Label_75           0.1
8       Label_75           0.1
9       Label_75           0.1
10      Label_75           0.1
11      Label_75           0.1

dfName:
  PRODUCT NAME  Allocation %
0       Name_1           1.0
1       Name_2           1.0
2       Name_3           0.2
3       Name_3           0.8

Output:

   PRODUCT NAME PRODUCT LABEL  PACKAGING ID  AMOUNT  Allocation %
0        Name_1      Label_66            57     123           1.0
1        Name_2      Label_67            10     123           1.0
2        Name_3      Label_68            52     123           0.1
3        Name_3      Label_68            52     123           0.2
4        Name_3      Label_68            52     123           0.7
5        Name_4      Label_69            16     123           0.2
6        Name_4      Label_69            16     123           0.4
7        Name_5      Label_70            48     123           NaN
8        Name_6      Label_71            21     123           NaN
9        Name_7      Label_72            33     123           NaN
10       Name_8      Label_73            75     123           NaN
11       Name_9      Label_74            19     123           0.5
12      Name_10      Label_75            97     123           0.1

If you want to eliminate rows with null allocation, you can add the following as a final line in the code:

df = df[df['Allocation %'].notna()].reset_index(drop=True)

Output:

  PRODUCT NAME PRODUCT LABEL  PACKAGING ID  AMOUNT  Allocation %
0       Name_1      Label_66            57     123           1.0
1       Name_2      Label_67            10     123           1.0
2       Name_3      Label_68            52     123           0.1
3       Name_3      Label_68            52     123           0.2
4       Name_3      Label_68            52     123           0.7
5       Name_4      Label_69            16     123           0.2
6       Name_4      Label_69            16     123           0.4
7       Name_9      Label_74            19     123           0.5
8      Name_10      Label_75            97     123           0.1

CodePudding user response：

import pandas as pd
import numpy as np

# recreated data
df = pd.DataFrame({'PRODUCT NAME': ["Name_" str(x) for x in np.arange(1, 11, 1)],
                   'PRODUCT LABEL':["Label_" str(x) for x in np.arange(66, 76, 1)],
                   'PACKAGING ID': [57, 10, 52, 16, 48, 21, 33, 75, 19, 97],
                   'AMOUNT': [7997, 4974, 2862, 7541, 2904, 1030, 4457, 1316, 1092, 3279]})

m_pack = pd.DataFrame({'PACKAGING_ID': [52]*3   [16]*4,
                       'Allocation %': [.1, .2, .7, .2, .2, .2, .4]})
m_name = pd.DataFrame({'PRODUCT NAME': ["Name_1", "Name_2", "Name_3", "Name_4"],
                       'Allocation %': [1, 1, .2, .8]})
m_label = pd.DataFrame([["Label_74", .5]]*2   [["Label_75", .1]]*10, columns=["PRODUCT LABEL", "Allocation %"])

Merging, as per your example (with a slight difference):

# merge as your did, but setting index first so that these columns are not carried across
df2 = pd.merge(df, m_pack.set_index("PACKAGING_ID"), left_on="PACKAGING ID", right_index=True, how="left")
df2 = pd.merge(df2, m_label.set_index("PRODUCT LABEL"), left_on="PRODUCT LABEL", right_index=True, how="left")
df2 = pd.merge(df2, m_name.set_index("PRODUCT NAME"), left_on="PRODUCT NAME", right_index=True, how="left")

Creating the final dataframe with the data from df2:

# final df not including the "Allocation" columns.
final_df = df2.loc[:,~df2.columns.str.contains("Allocation")]
# fillna for each, progressing in order of merge/priority.
final_df["Allocation %"] = df2["Allocation %_x"].fillna(df2["Allocation %_y"]).fillna(df2["Allocation %"])
# add the final column of where the allocation came from.
final_df["Comes from"] = np.where(df2["Allocation %_x"].notna(), "(comes from PACKAGING_ID)",
                                  np.where(df2["Allocation %_y"].notna(), "(comes from PRODUCT LABEL)",
                                           "(comes from PRODUCT NAME)"))
final_df.reset_index(inplace=True)

final_df
#     index PRODUCT NAME PRODUCT LABEL  PACKAGING ID  AMOUNT  Allocation %                  Comes from
# 0       0       Name_1      Label_66            57    7997           1.0   (comes from PRODUCT NAME)
# 1       1       Name_2      Label_67            10    4974           1.0   (comes from PRODUCT NAME)
# 2       2       Name_3      Label_68            52    2862           0.1   (comes from PACKAGING_ID)
# 3       2       Name_3      Label_68            52    2862           0.2   (comes from PACKAGING_ID)
# 4       2       Name_3      Label_68            52    2862           0.7   (comes from PACKAGING_ID)
# 5       3       Name_4      Label_69            16    7541           0.2   (comes from PACKAGING_ID)
# 6       3       Name_4      Label_69            16    7541           0.2   (comes from PACKAGING_ID)
# 7       3       Name_4      Label_69            16    7541           0.2   (comes from PACKAGING_ID)
# 8       3       Name_4      Label_69            16    7541           0.4   (comes from PACKAGING_ID)
# 9       4       Name_5      Label_70            48    2904           NaN   (comes from PRODUCT NAME)
# 10      5       Name_6      Label_71            21    1030           NaN   (comes from PRODUCT NAME)
# 11      6       Name_7      Label_72            33    4457           NaN   (comes from PRODUCT NAME)
# 12      7       Name_8      Label_73            75    1316           NaN   (comes from PRODUCT NAME)
# 13      8       Name_9      Label_74            19    1092           0.5  (comes from PRODUCT LABEL)
# 14      8       Name_9      Label_74            19    1092           0.5  (comes from PRODUCT LABEL)
# 15      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 16      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 17      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 18      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 19      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 20      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 21      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 22      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 23      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)
# 24      9      Name_10      Label_75            97    3279           0.1  (comes from PRODUCT LABEL)

You will notice there were a couple of missing values in the Allocation % column, this is because there was no data for it in the PRODUCT NAME mapping. It looks like you have set them to 1, but I wasn't sure if I could do that. If that is the solution, add another .fillna(1) to the .fillna line.