Pandas Data Processing-CodePudding

I have two data frames as follow.

df1 :

	A	B	C	D	E	F
	1	2	3	x	y	z

df2 :

	A	B	C	G	H	I
	4	5	6	p	q	r

I wanted to make three data frames using these two like this.

ndf1 :

	A	B	C	ID
	1	2	3	id_0
	4	5	6	id_1

ndf2 :

	ID	D	E	F
	id_0	x	y	z

ndf3 :

	ID	G	H	I
	id_1	p	q	r

ndf1 contains unique keys of both df1 and df2.
ndf2 contains data from df1
ndf3 contains data from df2

My questions are, Is this workflow possible with Python and Pandas? and What would be the best approach for that?

CodePudding user response：

You could use pd.DataFrame.merge to merge df1 and df2. Then simply drop the unwanted columns from merged df to create ndf1

To create the ndf2 and ndf3 just drop the unwanted A, B, and C columns from df1 and df2 using pd.DataFrame.drop

FYI

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

CodePudding user response：

I will differ a bit from your formulation, in order to:

follow the convention of Unix/Linux's comm command.
use the index of df1, resp. df2, to indicate where the rows come from (instead of creating id_0 etc.) If necessary, the index of df1 and df2 can be set to uniquely indicate provenance.

With this, you can do:

ix1 = df1.columns.difference(df2.columns)  # unique to df1
ix2 = df2.columns.difference(df1.columns)  # unique to df2
ix3 = df1.columns.intersection(df2.columns)  # common

u1 = df1[ix1]
u2 = df2[ix2]
common = pd.concat([df1[ix3], df2[ix3]])

On your example data:

>>> u1
   D  E  F
0  x  y  z

>>> u2
   G  H  I
0  p  q  r

>>> common
   A  B  C
0  1  2  3
0  4  5  6

Addendum

Here is an example of how to modify the index of df1 and df2 to indicate unambiguously provenance:

index1 = [f'df1_{i}' for i in df1.index]
index2 = [f'df2_{i}' for i in df2.index]

u1 = df1.set_axis(index1)[ix1]
u2 = df2.set_axis(index2)[ix2]
common = pd.concat([df1.set_axis(index1)[ix3], df2.set_axis(index2)[ix3]])

And now:

>>> u1
       D  E  F
df1_0  x  y  z

>>> u2
       G  H  I
df2_0  p  q  r

>>> common
       A  B  C
df1_0  1  2  3
df2_0  4  5  6