Merge two Pandas datadrames and specify rules for merging cells-CodePudding

I have two pandas dataframes that I want to merge, but I want to specify a rule (function) for how it decides to merge cells that are shared.

The two rules I am interested in are

If one cell is NaN and the other has a value, use the one with a value
If two cells have different (non-NaN) values, raise an error

For example

import pandas as pd
import numpy as np

df1 = pd.DataFrame([["R", np.nan], ["R", "S"]], index=[0, 1], columns=["a", "b"])
df1

    a   b
0   R   NaN
1   R   S

Rule 1 - If one cell is NaN and the other has a value, use the one with a value

df2 = pd.DataFrame([["R", "S"], ["R", "S"]], index=[0, 2], columns=["a", "b"])
df2
    a   b
0   R   S
2   R   S

Now, when I go to merge them

df1.merge(df2, how="outer")
    a   b
0   R   NaN
1   R   S
2   R   S

I would like it to replace the empty (NaN) cells with the non-NaN value. i.e.

    a   b
0   R   S
1   R   S
2   R   S

Rule 2 - If two cells have different (non-NaN) values, raise an error

In addition, I would like to add an "integrity validation" function that checks if two cells with the same index and column name, that are non-NaN, have the same value.

For example

df3 = pd.DataFrame([["R", "R"], ["R", "S"]], index=[1, 2], columns=["a", "b"])
df3
    a   b
1   R   R
2   R   S

If I were to try and merge this with df1, I would get the following

df1.merge(df3, how="outer")
    a   b
0   R   NaN
1   R   S
2   R   R

But I want this to error as the cell [1,"b"] is S in df1, but R in df3.

CodePudding user response：

The solution I came up with uses combine.

First, create a function that handles the logic of comparing two cells

def comp(c1, c2):
    if pd.isna(c1):
        return c2
    elif pd.isna(c2):
        return c1
    elif c1 != c2:
        raise ValueError(f"{c1} != {c2}")
    else:
        return c1

The function that combine calls take two Series, so we just create a lambda that applys comp to each Series element-wise

Rule 1

f = lambda s1, s2: np.array([comb(c1, c2) for c1, c2 in zip(s1,s2)])
df1.combine(df2, f, overwrite=False)
    a   b
0   R   S
1   R   S
2   R   S

Rule 2

df1.combine(df3, f, overwrite=False)
...clipped error message...
ValueError: S != R

CodePudding user response：

Here's one way using concat and groupby.apply where we fill NaN using the other values and drop_duplicates. Then use an if-else where we check if the index is duplicated; if it's duplicated, it means there was a value that didn't match across the DataFrames.

def func(df1, df2):
    out = (pd.concat((df1, df2)).groupby(level=0)
           .apply(lambda x: x.ffill().bfill().drop_duplicates()).droplevel(-1))
    if out.index.duplicated().any():
        raise ValueError("There's a mismatch.")
    else:
        return out