Home > Software design >  Compare two column Pandas row per row
Compare two column Pandas row per row

Time:11-30

I want to compare 2 column. If same will True if not same will False like this:

filtering lemmatization check
[hello, world] [hello, world] True
[grape, durian] [apple, grape] False

The output from my code is all False. But, the data actually is different. Why?

You can get my data github

import pandas as pd

dc = pd.read_excel('./data clean (spaCy).xlsx')
dc['check'] = dc['filtering'].equals(dc['lemmatization'])

CodePudding user response:

Here is difference between columns - in one column missing '' around strings, possible solution is convert both columns to lists, for comapre use Series.eq (working like ==):

import ast

dc = pd.read_excel('data clean (spaCy).xlsx')

#removed trailing [] and split by ` ,`
dc['filtering'] = dc['filtering'].str.strip('[]').str.split(', ')
#there are string separators, so working literal_eval
dc['lemmatization'] = dc['lemmatization'].apply(ast.literal_eval)

#compare
dc['check'] = dc['filtering'].eq(dc['lemmatization'])
print (dc.head())
   label                                          filtering  \
0      2                                         [ppkm, ya]   
1      2  [mohon, informasi, pgs, pasar, turi, ppkm, buk...   
2      2                                      [rumah, ppkm]   
3      1  [pangkal, penanganan, pandemi, indonesia, terk...   
4      1                              [ppkm, mikro, anjing]   

                                       lemmatization  check  
0                                         [ppkm, ya]   True  
1  [mohon, informasi, pgs, pasar, turi, ppkm, buk...   True  
2                                      [rumah, ppkm]   True  
3  [pangkal, tangan, pandemi, indonesia, kesan, s...  False  
4                              [ppkm, mikro, anjing]   True  

Reason for False is Series.equals return scalar, so here False

  • Related