The following code will compare differences in two dataframes (synthetically imported from Excel):
import pandas as pd
import numpy as np
a = {'A': ['1',2,'3',4,'5'], 'B' : ['abcd', 'efgh', 'ijkl', 'uhyee', 'uhuh'], 'C' : ['jamba','refresh','portobello','performancehigh','jackalack']}
a = pd.DataFrame(a)
b = {'A': ['1',2,'3',4,'5'], 'Z' : ['dah', 'fupa', 'ijkl', 'danju', 'uhuh'], 'C' : ['jamba','dimez','pocketfresh','reverbb','jackalack']}
b = pd.DataFrame(b)
comparevalues = a.values == b.values
rows,cols = np.where(comparevalues == False)
for item in zip(rows,cols):
a.iloc[item[0],item[1]] = ' {} --> {} '.format(a.iloc[item[0],item[1]], b.iloc[item[0],item[1]])
However, as soon as I extend dataframe b
by another line, the code breaks:
b = {'A': ['1',2,'3',4,'5', 6], 'Z' : ['dah', 'fupa', 'ijkl', 'danju', 'uhuh', 'freshhhhhhh'], 'C' : ['jamba','dimez','pocketfresh','reverbb','jackalack', 'boombackimmatouchit']}
b = pd.DataFrame(b)
How do I still compare these two data frames for differences?
CodePudding user response:
You could define a helper function to adjust the length of two dataframes:
def equalize_length(short, long):
return pd.concat(
[
short,
pd.DataFrame(
{
col: ["nan"] * (long.shape[0] - short.shape[0])
for col in short.columns
}
),
]
).reset_index(drop=True)
And then, in your code:
if a.shape[0] <= b.shape[0]:
a = equalize_length(a, b)
else:
b = equalize_length(b, a)
comparevalues = a.values == b.values
rows, cols = np.where(comparevalues == False)
for item in zip(rows, cols):
a.iloc[item[0], item[1]] = " {} --> {} ".format(
a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
)
print(a) # with 'a' being shorter than 'b'
# Output
A B C
0 1 abcd --> dah jamba
1 2 efgh --> fupa refresh --> dimez
2 3 ijkl portobello --> pocketfresh
3 4 uhyee --> danju performancehigh --> reverbb
4 5 uhuh jackalack
5 nan --> 6 nan --> freshhhhhhh nan --> boombackimmatouchit