I want to compare 2 column. If same will True if not same will False like this:
filtering | lemmatization | check |
---|---|---|
[hello, world] | [hello, world] | True |
[grape, durian] | [apple, grape] | False |
The output from my code is all False. But, the data actually is different. Why?
You can get my data github
import pandas as pd
dc = pd.read_excel('./data clean (spaCy).xlsx')
dc['check'] = dc['filtering'].equals(dc['lemmatization'])
CodePudding user response:
Here is difference between columns - in one column missing ''
around strings, possible solution is convert both columns to lists, for comapre use Series.eq
(working like ==
):
import ast
dc = pd.read_excel('data clean (spaCy).xlsx')
#removed trailing [] and split by ` ,`
dc['filtering'] = dc['filtering'].str.strip('[]').str.split(', ')
#there are string separators, so working literal_eval
dc['lemmatization'] = dc['lemmatization'].apply(ast.literal_eval)
#compare
dc['check'] = dc['filtering'].eq(dc['lemmatization'])
print (dc.head())
label filtering \
0 2 [ppkm, ya]
1 2 [mohon, informasi, pgs, pasar, turi, ppkm, buk...
2 2 [rumah, ppkm]
3 1 [pangkal, penanganan, pandemi, indonesia, terk...
4 1 [ppkm, mikro, anjing]
lemmatization check
0 [ppkm, ya] True
1 [mohon, informasi, pgs, pasar, turi, ppkm, buk... True
2 [rumah, ppkm] True
3 [pangkal, tangan, pandemi, indonesia, kesan, s... False
4 [ppkm, mikro, anjing] True
Reason for False
is Series.equals
return scalar, so here False