Home > Mobile >  Replace values in dataframe with values from another dataframe of different size
Replace values in dataframe with values from another dataframe of different size

Time:02-03

I have two different dataframes of different sizes. The first one only has few rows and the second has some hundreds. For the rows in the second one where some columns are the same as in the first, I want to overwrite some of the columns in the second with the values from the first.

The first dataframe is (and could potentially have more rows)

label1 label2 beta1 beta2
123 456 1 2

and the second is (the dashes are to indicate more rows)

label1 label2 beta1 beta2
892 531 0.01 0.002
541 414 0.04 0.03
123 456 0.02 0.05
--- --- --- ---

The expected results is as below. The index of the overlapping rows could be any value (but will only appear once) in df2.

label1 label2 beta1 beta2
892 531 0.01 0.002
541 414 0.04 0.03
123 456 1 2
--- --- --- ---

because label1 and label2 are equal and the values are from the first dataframe.

I have tried to filter and replace the values in the second dataframe with

df2[['beta1', 'beta2']] = df1[(df1['label1'].isin(df2['label1'])) & (df1['label2'].isin(df2['label2']))]['beta1', 'beta2'].values

I get the error that the indices are not of equal length and I understand that. How can I "force" it to ignore that and just replace the overlapping values? I don't know beforehand which values I have in df1 for label1 and label2 so I don't want to hardcode any values. I guess I could brute force it and take out the values from df1 but there must be a pythonic way to do this.

Edit: this is the code I used with @Timeless solution.

    df1 = pd.DataFrame({'label1': [123], 'label2': [456], 
    'beta1': [1], 'beta2': [2]})
    df2 = pd.DataFrame({'label1': [541, 123], 'label2': 
    [414, 456], 'beta1': [0.01, 0.04], 'beta2': [0.002, 
    0.03]})
    idx, cols = ["label1", "label2"], ["beta1", "beta2"]

    mask = (df2.set_index(idx).index
        .isin(df1.set_index(idx).index))
    df2.loc[mask, cols] = df1[cols]

Output:

    print(df2)
   
        label1 label2 beta1 beta2
    0    541   414    0.01  0.002
    1    123   456    NaN   NaN

CodePudding user response:

Since you started with isin, here is a similar approach (but on the index) with boolean indexing :

idx, cols = ["label1", "label2"], ["beta1", "beta2"]

m = (df2.set_index(idx).index
         .isin(df1.set_index(idx).index))
​
df2.loc[mask, cols] = df1[cols]
​

Output :

print(df2)
​​
   label1  label2  beta1  beta2
0     123     456   1.00   2.00
1     541     414   0.04   0.03

CodePudding user response:

Here is an approach using np.where() method from numpy

import numpy as np

mask = df2[["label1", "label2"]].isin(df1[["label1", "label2"]]).all(axis=1)
df2["beta1"] = np.where(mask, df1["beta1"].values[0], df2["beta1"])
df2["beta2"] = np.where(mask, df1["beta2"].values[0], df2["beta2"])
print(df2)

You can also df.update() and .isin() methods

(df2.update(df2[df2[["label1", "label2"]].isin(df1[["label1", "label2"]]
                    .to_dict(orient="records")).all(axis=1)]
                    .assign(beta1=df1["beta1"], beta2=df1["beta2"])))
print(df2)

   label1  label2  beta1  beta2
0     123     456   1.00   2.00
1     541     414   0.04   0.03
  • Related