Home > Software design >  How to find differences between two dataframes of different lengths?
How to find differences between two dataframes of different lengths?

Time:06-06

The following code will compare differences in two dataframes (synthetically imported from Excel):

import pandas as pd
import numpy as np

a = {'A': ['1',2,'3',4,'5'], 'B' : ['abcd', 'efgh', 'ijkl', 'uhyee', 'uhuh'], 'C' : ['jamba','refresh','portobello','performancehigh','jackalack']}
a = pd.DataFrame(a)

b = {'A': ['1',2,'3',4,'5'], 'Z' : ['dah', 'fupa', 'ijkl', 'danju', 'uhuh'], 'C' : ['jamba','dimez','pocketfresh','reverbb','jackalack']}
b = pd.DataFrame(b)

comparevalues = a.values == b.values

rows,cols = np.where(comparevalues == False)

for item in zip(rows,cols):
    a.iloc[item[0],item[1]] = ' {} --> {} '.format(a.iloc[item[0],item[1]], b.iloc[item[0],item[1]])

However, as soon as I extend dataframe b by another line, the code breaks:

b = {'A': ['1',2,'3',4,'5', 6], 'Z' : ['dah', 'fupa', 'ijkl', 'danju', 'uhuh', 'freshhhhhhh'], 'C' : ['jamba','dimez','pocketfresh','reverbb','jackalack', 'boombackimmatouchit']}
b = pd.DataFrame(b)

How do I still compare these two data frames for differences?

CodePudding user response:

You could define a helper function to adjust the length of two dataframes:

def equalize_length(short, long):
    return pd.concat(
        [
            short,
            pd.DataFrame(
                {
                    col: ["nan"] * (long.shape[0] - short.shape[0])
                    for col in short.columns
                }
            ),
        ]
    ).reset_index(drop=True)

And then, in your code:

if a.shape[0] <= b.shape[0]:
    a = equalize_length(a, b)
else:
    b = equalize_length(b, a)

comparevalues = a.values == b.values

rows, cols = np.where(comparevalues == False)

for item in zip(rows, cols):
    a.iloc[item[0], item[1]] = " {} --> {} ".format(
        a.iloc[item[0], item[1]], b.iloc[item[0], item[1]]
    )
print(a)  # with 'a' being shorter than 'b'
# Output
             A                      B                              C
0            1          abcd --> dah                           jamba
1            2         efgh --> fupa              refresh --> dimez
2            3                   ijkl    portobello --> pocketfresh
3            4       uhyee --> danju    performancehigh --> reverbb
4            5                   uhuh                      jackalack
5   nan --> 6    nan --> freshhhhhhh    nan --> boombackimmatouchit
  • Related