How can I merge partial data into a given df, without changing unknown values? Here is a minimal example:
>>> import pandas as pd
>>> import numpy as np
>>> data = {
... "Feat_A": ["INVALID", "INVALID", "INVALID"],
... "Feat_B": ["INVALID", "INVALID", "INVALID"],
... "Key": [12, 25, 99],
... }
>>>
>>> origin = pd.DataFrame(data=data)
>>> data = {"Feat_A": [1, np.nan], "Feat_B": [np.nan, 2], "Key": [12, 99]}
>>> new = pd.DataFrame(data=data)
>>> origin = origin.merge(
... new[["Key", "Feat_A", "Feat_B"]],
... on="Key",
... how="left",
... )
>>>
>>> origin
Feat_A_x Feat_B_x Key Feat_A_y Feat_B_y
0 INVALID INVALID 12 1.0 NaN
1 INVALID INVALID 25 NaN NaN
2 INVALID INVALID 99 NaN 2.0
This is what I am looking for:
# Feat_A Feat_B Key
# 0 1.0 INVALID 12
# 1 INVALID INVALID 25
# 2 INVALID 2.0 99
CodePudding user response:
First set "INVALID" cells to NAN then you can set key
as index and use fillna method to fill the values from new to origin.
import numpy as np
origin = origin.map({'INVALID':np.nan})
originNew=origin.set_index('Key').fillna(new.set_index("Key"))
CodePudding user response:
Found pretty solution here: Overwrite columns in DataFrames of different sizes pandas
df_result = df2.set_index('ids').combine_first(df1.set_index('ids'))
df_result.reset_index()
Furthermore, df.update() seems to do something like that, but I did not figure out how to take care of different sizes.