I have 2 dataframes df1
and df2
in which df1
contains only 0. Let's call the sets of columns in df1
and df2
by col1
and col2
respectively. For each column name in col2
, if it belongs to col1
, I will replace the corresponding column in df1
by its counterpart in df2
.
In my example, df1
is
and df2
is
Then my desired result is
To do so, I use loop through each column name in col2
. However, Python returns a warning
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insert
many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()
Could you elaborate on a more efficient method?
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[0, 0, 0], [0, 0, 0]]),
columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[0, -2, 3], [4, 5, -6]]),
columns=['a', 'c', 'd'])
for c in df2.columns:
if c in df1.columns:
df1[c] = df2[c]
CodePudding user response:
You can use DataFrame.update
to update existing dataframe. Since, df2
has extra columns that are not in df1
DataFrame.update
adds them too. To update only correct columns/index we can use pd.Index.intersection
here.
cols = df2.columns.intersection(df1.columns)
idx = df2.index.intersection(df1.index)
df1.update(df2.loc[idx, cols])
print(df1)
a b c
0 0 0 -2
1 4 0 5
CodePudding user response:
I believe by replacing
df1[c] = df2[c]
with
df1[c][:] = df2[c]
you can ensure that you're overwriting the column in-place, rather than inserting a new column with the same name and removing the original one. This should prevent fragmentation. This only works if all your data has the same type though.