How do I compare two data frames to show the difference?-CodePudding

I am trying to compare two excel file that are different size, one has 5701 row and the other has 5904 row. The columns are Price and Project Description. I am trying to compare by text to see what the project difference are.


import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/Text/Downloads/D1.csv')
df2 = pd.read_csv('C:/Users/Text/Downloads/D2.csv')


df['text_match'] = np.where(df['Project ID'] == df2['Project ID'], 'True', 'False')
print(df.loc[(df['text_match'] == 'False')])

I get the following error when I try to run the code:

raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects

CodePudding user response：

To compare 2 dfs - the numbers of records must be the same:

You can use:

df1.equals(df2)

then get - false, there isn't option to compare.

There is option to use:

df1.reset_index(drop=True) == df2.reset_index(drop=True)

There is another option - cut the second df to (5701 rows)

# Number of rows to drop
n = 203
 
# Removing last n rows
df2 = df2.iloc[:-n]

CodePudding user response：

You can use the Pandas .compare() function, as follows:

df.compare(df2)

Sample output is as follows:

  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

It highlight the differences with self for df and other for df2

You can just compare part of the columns by e.g.

df[['Project ID', 'Price']].compare(df2[['Project ID', 'Price']])

Or, if you just want to compare only the Project ID column:

df['Project ID'].compare(df2['Project ID'])

Another method is to try filtering for unmatched Project ID, using .isin(), as follows:

df.loc[~df['Project ID'].isin(df2['Project ID'])]

and:

df2.loc[~df2['Project ID'].isin(df['Project ID'])]