First of all I'll make an example of the result I want to obtain. I initially have two DataFrames with, in general different column indexes and row indexes and eventually with different rows and columns number (even if in the example below are both 3x3):
Dataframe1 | Dataframe2
A B C | B D F
A x x x | A y y y
D x x x | B y y y
E x x x | E y y y
And I want the following result:
Result
A B C D F
A x y x y y
B - y - y y
D x x x - -
E x y x y y
Note the solution has the following characteristics:
- the resulting data frame contains all the rows and columns of both
dataframe1
anddataframe2
- where values overlap,
dataframe2
updatesdataframe1
(e.g. in positions [BA] and [BE] of result data frame, where in the same poistions indataframe1
there wasx
, now there'sy
) - Where missing values occur (here replaced by a dash
-
) a default value is inserted (likeNaN
) - The indexes names are preserved in the result table (but the alphabetical sorting is not necessary)
My questions are:
- Is it possible to do that with pandas? If yes, how? I've tried many different things but none of them worked 100%, but since I'm very new to pandas I might not know the proper way to do it.
- If pandas is not the right or easiest way to do it, is there another way that you would recommend for doing that (maybe using matrices, dictionaries, ...)
Thank you.
CodePudding user response:
try this:
data1 = {'A': {'A': 'x', 'D': 'x', 'E': 'x'},
'B': {'A': 'x', 'D': 'x', 'E': 'x'},
'C': {'A': 'x', 'D': 'x', 'E': 'x'}}
df1 = pd.DataFrame(data1)
print(df1)
>>>
A B C
A x x x
D x x x
E x x x
data2 = {'B': {'A': 'y', 'B': 'y', 'E': 'y'},
'D': {'A': 'y', 'B': 'y', 'E': 'y'},
'F': {'A': 'y', 'B': 'y', 'E': 'y'}}
df2 = pd.DataFrame(data2)
print(df2)
>>>
B D F
A y y y
B y y y
E y y y
res = df1.combine_first(df2)
print(res)
>>>
A B C D F
A x y x y y
B NaN y NaN y y
D x x x NaN NaN
E x y x y y
CodePudding user response:
try another:
cols = df1.columns.append(df2.columns).unique().sort_values()
idx = df1.index.append(df2.index).unique().sort_values()
res = df1.reindex(index=idx, columns=cols)
res.update(df2)
print(res)
>>>
A B C D F
A x y x y y
B NaN y NaN y y
D x x x NaN NaN
E x y x y y