Finding out which primary key id not work in a pandas merge-CodePudding

Source	Target

Two tables will be joined via composite keys ID1 and ID2 via pandas merge. We have a data testing tool that does the data analysis. It will perform the merge and filter whatever rows are not present on the left or right side of the merge into an external Missing In Source or Missing in Target table.

Missing in Target	Missing in Source

In this example in Source the second row with composite key ABC,345 is missing in Target. So that row will be filtered into Missing in Target. Similarly in Target the second row with composite key ABC,222 is missing in Source so it will be filtered into Missing in Source.

For the Missing ... tables the business wants to know "why exactly are they missing"--which composite key by row made the row missing? For example, for the row with ABC,345 in the Missing in Target above, ABC was present in both table rows but 345 wasn't. Therefore,ID2 with value 345 is the guilty key for this row.

CodePudding user response：

Try this using merge with how=outer and indicator:

import pandas as pd

df1 = pd.DataFrame({'ID1':'XYZ ABC EFG UK'.split(' '),
                    'ID2': [123,345, 789, 111],
                    'Data1':'car truck cat lion'.split(' '),
                    'Data2':'house sky mouse zion'.split(' ')})

df2 = pd.DataFrame({'ID1':'XYZ ABC ZAB UK'.split(' '),
                    'ID2':[123,222,789,111],
                    'Data1':'car truck cat lion'.split(' '),
                    'Data2':'house sky mouse zion'.split(' ')})

df_r = df1.merge(df2, how='outer', indicator='Ind')

df_source_missing = df_r.query('Ind == "left_only"')
df_target_missing = df_r.query('Ind == "right_only"')

Output:

print(df_source_missing)

   ID1  ID2  Data1  Data2        Ind
1  ABC  345  truck    sky  left_only
2  EFG  789    cat  mouse  left_only

print(df_target_missing)

   ID1  ID2  Data1  Data2         Ind
4  ABC  222  truck    sky  right_only
5  ZAB  789    cat  mouse  right_only

CodePudding user response：

I believe you want .compare:

import pandas as pd


data1 = {
    "ID1": ["XYZ", "ABC", "EFG", "IJK"],
    "ID2": [123, 345, 789, 111],
    "DATA1": ["car", "truck", "cat", "lion"],
    "DATA2": ["house", "sky", "mouse", "zion"]
}

data2 = {
    "ID1": ["XYZ", "ABC", "ZAB", "IJK"],
    "ID2": [123, 222, 789, 111],
    "DATA1": ["car", "truck", "cat", "lion"],
    "DATA2": ["house", "sky", "mouse", "zion"]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df = (
    df1
    .compare(df2, align_axis=0)
    .rename(index={"self": "df1", "other": "df2"}, level=-1)
    .fillna("")
    .reset_index()
    .drop("level_0", axis=1)
    .rename(columns={"level_1": "source"})
)
print(df)

  source  ID1    ID2
0    df1       345.0
1    df2       222.0
2    df1  EFG       
3    df2  ZAB