I have two pandas dataframes that I want to merge, but I want to specify a rule (function) for how it decides to merge cells that are shared.
The two rules I am interested in are
- If one cell is NaN and the other has a value, use the one with a value
- If two cells have different (non-NaN) values, raise an error
For example
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["R", np.nan], ["R", "S"]], index=[0, 1], columns=["a", "b"])
df1
a b
0 R NaN
1 R S
Rule 1 - If one cell is NaN and the other has a value, use the one with a value
df2 = pd.DataFrame([["R", "S"], ["R", "S"]], index=[0, 2], columns=["a", "b"])
df2
a b
0 R S
2 R S
Now, when I go to merge them
df1.merge(df2, how="outer")
a b
0 R NaN
1 R S
2 R S
I would like it to replace the empty (NaN) cells with the non-NaN value. i.e.
a b
0 R S
1 R S
2 R S
Rule 2 - If two cells have different (non-NaN) values, raise an error
In addition, I would like to add an "integrity validation" function that checks if two cells with the same index and column name, that are non-NaN, have the same value.
For example
df3 = pd.DataFrame([["R", "R"], ["R", "S"]], index=[1, 2], columns=["a", "b"])
df3
a b
1 R R
2 R S
If I were to try and merge this with df1
, I would get the following
df1.merge(df3, how="outer")
a b
0 R NaN
1 R S
2 R R
But I want this to error as the cell [1,"b"]
is S
in df1
, but R
in df3
.
CodePudding user response:
The solution I came up with uses combine
.
First, create a function that handles the logic of comparing two cells
def comp(c1, c2):
if pd.isna(c1):
return c2
elif pd.isna(c2):
return c1
elif c1 != c2:
raise ValueError(f"{c1} != {c2}")
else:
return c1
The function that combine
calls take two Series, so we just create a lambda
that applys comp
to each Series element-wise
Rule 1
f = lambda s1, s2: np.array([comb(c1, c2) for c1, c2 in zip(s1,s2)])
df1.combine(df2, f, overwrite=False)
a b
0 R S
1 R S
2 R S
Rule 2
df1.combine(df3, f, overwrite=False)
...clipped error message...
ValueError: S != R
CodePudding user response:
Here's one way using concat
and groupby.apply
where we fill NaN using the other values and drop_duplicates
. Then use an if-else
where we check if the index is duplicated; if it's duplicated, it means there was a value that didn't match across the DataFrames.
def func(df1, df2):
out = (pd.concat((df1, df2)).groupby(level=0)
.apply(lambda x: x.ffill().bfill().drop_duplicates()).droplevel(-1))
if out.index.duplicated().any():
raise ValueError("There's a mismatch.")
else:
return out