Home > database >  how to render distinct columns/rows by comparing two dataframes in pandas?
how to render distinct columns/rows by comparing two dataframes in pandas?

Time:02-25

I have two dataframes but they have more common columns and few distinct columns that only appeared in one of dataframe. I want to print out those distinct columns and common columns so can have better idea what columns are changed in another dataframe. I got some interesting post on SO but don't know why I got an error. I have two dataframes which has following shape:

df19.shape
  (39831, 1952)

df20.shape
(39821, 1962)

here is dummy data:

df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6],[11,13],[10,19],[21,23]], columns=['A', 'B'])
df2 = pd.DataFrame([[3, 4,0,7], [1, 3,9,2], [4, 6,3,8],[8,5,1,6]], columns=['A', 'B','C','D'])

current attempt

I came across SO and tried following:

res=pd.concat([df19, df20]).loc[df19.index.symmetric_difference(df20.index)]
res.shape
(10, 1984)

this gave me distinct rows but not distinct columns.I also tried this one but gave me error:

df19.compare(df20, keep_equal=True, keep_shape=True)

how should I render distinct rows and columns by comparing two dataframes in pandas? Does anyone knows of doing this easily in python? Any quick thoughts? Thanks

objective

I simply want to render distinct rows or columns to compare two dataframe by column name or what distinct rows that it has. for instance, compared to df1, what columns are newly added to df2; similarly what rows are added to df2 and so on. Any idea?

CodePudding user response:

I would recommend getting the columns by filtering by the name of the columns.

common = [i for i in list(df1) if i in list(df2)]
temp = df2[common]

distinct = [i for i in list(df2) if i not in list(df1)]
temp = df2[distinct]

CodePudding user response:

Thanks to @Shaido, this one worked for me:

import pandas as pd

df1=pd.read_csv(data1)
df2=pd.read_csv(data2)

df1_cols = df1.columns
df2_cols = df2.columns

common_cols = df1_cols.intersection(df2_cols)
df2_not_df1 = df2_cols.difference(df1_cols)
  • Related