Home > Software engineering >  Finding differences between dataframes while iterating a dataframe index
Finding differences between dataframes while iterating a dataframe index

Time:04-14

I'm trying to compare items in a dataframe that share same values but different columns index.

data1 = {'foo': [1,2,3], 'bar': [4,5,6]}
data2 = {'foo': [1,2,50], 'boo': [4,5,6]}
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)

I'm trying to write a script that compare the values of the two dataframes based on the index.

The desired output is:

same
same
different

same
same
same

Since I should keep different column names, here is my code

#Input

for row in data1.index:
    if data1.iloc[row, 0] == data2.iloc[row, 0]:
        print('same')
    else:
        print('different')

While I can sucessfully iterate through foo column, I'm yet unable to iterate through bar and boo.

#Output 

same
same
different

If I try to keep only if data1.iloc[row] while slicing, it throws: ValueError: Can only compare identically-labeled Series objects

Do you have suggestions? Thanks

CodePudding user response:

Assuming the dimensions match, one way is to use numpy.where:

out = np.where(data1.to_numpy() == data2.to_numpy(), 'same','different').T.tolist()

Output:

[['same', 'same', 'different'], ['same', 'same', 'same']]

You could also construct a DataFrame:

out = pd.DataFrame(np.where(data1.to_numpy() == data2.to_numpy(), 'same','different'), columns=['first_column','second_column'])

Output:

  first_column second_column
0         same          same
1         same          same
2    different          same

CodePudding user response:

Try using the underlying numpy array, which ignores the columns:

for i in range(df1.shape[1]):
    print("compare column", i)
    print(df1.iloc[:,i].to_numpy() == df2.iloc[:,i].to_numpy())
  • Related