I am trying to compare two excel file that are different size, one has 5701 row and the other has 5904 row. The columns are Price and Project Description. I am trying to compare by text to see what the project difference are.
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/Text/Downloads/D1.csv')
df2 = pd.read_csv('C:/Users/Text/Downloads/D2.csv')
df['text_match'] = np.where(df['Project ID'] == df2['Project ID'], 'True', 'False')
print(df.loc[(df['text_match'] == 'False')])
I get the following error when I try to run the code:
raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
CodePudding user response:
To compare 2 dfs - the numbers of records must be the same:
You can use:
df1.equals(df2)
then get - false, there isn't option to compare.
There is option to use:
df1.reset_index(drop=True) == df2.reset_index(drop=True)
There is another option - cut the second df to (5701 rows)
# Number of rows to drop
n = 203
# Removing last n rows
df2 = df2.iloc[:-n]
CodePudding user response:
You can use the Pandas .compare()
function, as follows:
df.compare(df2)
Sample output is as follows:
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
It highlight the differences with self
for df
and other
for df2
You can just compare part of the columns by e.g.
df[['Project ID', 'Price']].compare(df2[['Project ID', 'Price']])
Or, if you just want to compare only the Project ID
column:
df['Project ID'].compare(df2['Project ID'])
Another method is to try filtering for unmatched Project ID
, using .isin()
, as follows:
df.loc[~df['Project ID'].isin(df2['Project ID'])]
and:
df2.loc[~df2['Project ID'].isin(df['Project ID'])]