I have two different dataframes of different sizes. The first one only has few rows and the second has some hundreds. For the rows in the second one where some columns are the same as in the first, I want to overwrite some of the columns in the second with the values from the first.
The first dataframe is (and could potentially have more rows)
label1 | label2 | beta1 | beta2 |
---|---|---|---|
123 | 456 | 1 | 2 |
and the second is (the dashes are to indicate more rows)
label1 | label2 | beta1 | beta2 |
---|---|---|---|
892 | 531 | 0.01 | 0.002 |
541 | 414 | 0.04 | 0.03 |
123 | 456 | 0.02 | 0.05 |
--- | --- | --- | --- |
The expected results is as below. The index of the overlapping rows could be any value (but will only appear once) in df2.
label1 | label2 | beta1 | beta2 |
---|---|---|---|
892 | 531 | 0.01 | 0.002 |
541 | 414 | 0.04 | 0.03 |
123 | 456 | 1 | 2 |
--- | --- | --- | --- |
because label1 and label2 are equal and the values are from the first dataframe.
I have tried to filter and replace the values in the second dataframe with
df2[['beta1', 'beta2']] = df1[(df1['label1'].isin(df2['label1'])) & (df1['label2'].isin(df2['label2']))]['beta1', 'beta2'].values
I get the error that the indices are not of equal length and I understand that. How can I "force" it to ignore that and just replace the overlapping values? I don't know beforehand which values I have in df1 for label1 and label2 so I don't want to hardcode any values. I guess I could brute force it and take out the values from df1 but there must be a pythonic way to do this.
Edit: this is the code I used with @Timeless solution.
df1 = pd.DataFrame({'label1': [123], 'label2': [456],
'beta1': [1], 'beta2': [2]})
df2 = pd.DataFrame({'label1': [541, 123], 'label2':
[414, 456], 'beta1': [0.01, 0.04], 'beta2': [0.002,
0.03]})
idx, cols = ["label1", "label2"], ["beta1", "beta2"]
mask = (df2.set_index(idx).index
.isin(df1.set_index(idx).index))
df2.loc[mask, cols] = df1[cols]
Output:
print(df2)
label1 label2 beta1 beta2
0 541 414 0.01 0.002
1 123 456 NaN NaN
CodePudding user response:
Since you started with isin
, here is a similar approach (but on the index) with boolean indexing :
idx, cols = ["label1", "label2"], ["beta1", "beta2"]
m = (df2.set_index(idx).index
.isin(df1.set_index(idx).index))
df2.loc[mask, cols] = df1[cols]
Output :
print(df2)
label1 label2 beta1 beta2
0 123 456 1.00 2.00
1 541 414 0.04 0.03
CodePudding user response:
Here is an approach using np.where()
method from numpy
import numpy as np
mask = df2[["label1", "label2"]].isin(df1[["label1", "label2"]]).all(axis=1)
df2["beta1"] = np.where(mask, df1["beta1"].values[0], df2["beta1"])
df2["beta2"] = np.where(mask, df1["beta2"].values[0], df2["beta2"])
print(df2)
You can also df.update()
and .isin()
methods
(df2.update(df2[df2[["label1", "label2"]].isin(df1[["label1", "label2"]]
.to_dict(orient="records")).all(axis=1)]
.assign(beta1=df1["beta1"], beta2=df1["beta2"])))
print(df2)
label1 label2 beta1 beta2
0 123 456 1.00 2.00
1 541 414 0.04 0.03