I have two data frames as follow.
df1 :
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | 2 | 3 | x | y | z |
df2 :
A | B | C | G | H | I | |
---|---|---|---|---|---|---|
4 | 5 | 6 | p | q | r |
I wanted to make three data frames using these two like this.
ndf1 :
A | B | C | ID | |
---|---|---|---|---|
1 | 2 | 3 | id_0 | |
4 | 5 | 6 | id_1 |
ndf2 :
ID | D | E | F | |
---|---|---|---|---|
id_0 | x | y | z |
ndf3 :
ID | G | H | I | |
---|---|---|---|---|
id_1 | p | q | r |
- ndf1 contains unique keys of both df1 and df2.
- ndf2 contains data from df1
- ndf3 contains data from df2
My questions are, Is this workflow possible with Python and Pandas? and What would be the best approach for that?
CodePudding user response:
You could use pd.DataFrame.merge to merge df1 and df2. Then simply drop the unwanted columns from merged df to create ndf1
To create the ndf2 and ndf3 just drop the unwanted A, B, and C columns from df1 and df2 using pd.DataFrame.drop
FYI
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
CodePudding user response:
I will differ a bit from your formulation, in order to:
- follow the convention of Unix/Linux's
comm
command. - use the index of
df1
, resp.df2
, to indicate where the rows come from (instead of creatingid_0
etc.) If necessary, the index ofdf1
anddf2
can be set to uniquely indicate provenance.
With this, you can do:
ix1 = df1.columns.difference(df2.columns) # unique to df1
ix2 = df2.columns.difference(df1.columns) # unique to df2
ix3 = df1.columns.intersection(df2.columns) # common
u1 = df1[ix1]
u2 = df2[ix2]
common = pd.concat([df1[ix3], df2[ix3]])
On your example data:
>>> u1
D E F
0 x y z
>>> u2
G H I
0 p q r
>>> common
A B C
0 1 2 3
0 4 5 6
Addendum
Here is an example of how to modify the index of df1
and df2
to indicate unambiguously provenance:
index1 = [f'df1_{i}' for i in df1.index]
index2 = [f'df2_{i}' for i in df2.index]
u1 = df1.set_axis(index1)[ix1]
u2 = df2.set_axis(index2)[ix2]
common = pd.concat([df1.set_axis(index1)[ix3], df2.set_axis(index2)[ix3]])
And now:
>>> u1
D E F
df1_0 x y z
>>> u2
G H I
df2_0 p q r
>>> common
A B C
df1_0 1 2 3
df2_0 4 5 6