I'm trying to compare items in a dataframe that share same values but different columns index.
data1 = {'foo': [1,2,3], 'bar': [4,5,6]}
data2 = {'foo': [1,2,50], 'boo': [4,5,6]}
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)
I'm trying to write a script that compare the values of the two dataframes based on the index.
The desired output is:
same
same
different
same
same
same
Since I should keep different column names, here is my code
#Input
for row in data1.index:
if data1.iloc[row, 0] == data2.iloc[row, 0]:
print('same')
else:
print('different')
While I can sucessfully iterate through foo
column, I'm yet unable to iterate through bar
and boo
.
#Output
same
same
different
If I try to keep only if data1.iloc[row]
while slicing, it throws:
ValueError: Can only compare identically-labeled Series objects
Do you have suggestions? Thanks
CodePudding user response:
Assuming the dimensions match, one way is to use numpy.where
:
out = np.where(data1.to_numpy() == data2.to_numpy(), 'same','different').T.tolist()
Output:
[['same', 'same', 'different'], ['same', 'same', 'same']]
You could also construct a DataFrame:
out = pd.DataFrame(np.where(data1.to_numpy() == data2.to_numpy(), 'same','different'), columns=['first_column','second_column'])
Output:
first_column second_column
0 same same
1 same same
2 different same
CodePudding user response:
Try using the underlying numpy array, which ignores the columns:
for i in range(df1.shape[1]):
print("compare column", i)
print(df1.iloc[:,i].to_numpy() == df2.iloc[:,i].to_numpy())