Home > Back-end >  How to merge partial data into a given df
How to merge partial data into a given df

Time:04-21

How can I merge partial data into a given df, without changing unknown values? Here is a minimal example:

>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     "Feat_A": ["INVALID", "INVALID", "INVALID"],
...     "Feat_B": ["INVALID", "INVALID", "INVALID"],
...     "Key": [12, 25, 99],
... }
>>> 
>>> origin = pd.DataFrame(data=data)
>>> data = {"Feat_A": [1, np.nan], "Feat_B": [np.nan, 2], "Key": [12, 99]}
>>> new = pd.DataFrame(data=data)
>>> origin = origin.merge(
...     new[["Key", "Feat_A", "Feat_B"]],
...     on="Key",
...     how="left",
... )
>>> 
>>> origin
  Feat_A_x Feat_B_x  Key  Feat_A_y  Feat_B_y
0  INVALID  INVALID   12       1.0       NaN
1  INVALID  INVALID   25       NaN       NaN
2  INVALID  INVALID   99       NaN       2.0

This is what I am looking for:

#    Feat_A   Feat_B    Key  
# 0  1.0      INVALID   12
# 1  INVALID  INVALID   25   
# 2  INVALID  2.0       99   

CodePudding user response:

First set "INVALID" cells to NAN then you can set key as index and use fillna method to fill the values from new to origin.

import numpy as np
origin = origin.map({'INVALID':np.nan})
originNew=origin.set_index('Key').fillna(new.set_index("Key"))

CodePudding user response:

Found pretty solution here: Overwrite columns in DataFrames of different sizes pandas

df_result = df2.set_index('ids').combine_first(df1.set_index('ids'))
df_result.reset_index()

Furthermore, df.update() seems to do something like that, but I did not figure out how to take care of different sizes.

  • Related