Pandas isin() does not return anything even when the keywords exist in the dataframe-CodePudding

I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.

keywords = ['fake', 'false', 'lie']

df1:

	text
19152	I think she is the Corona Virus....
19154	Boy you hate to see that. I mean seeing how it was contained and all.
19155	Tell her it’s just the fake flu, it will go away in a few days.
19235	Is this fake news?
...	...
20540	She’ll believe it’s just alternative facts.

Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.

str.contains()

df1[df1['text'].str.contains("|".join(keywords))]

The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!

pandas.Series.isin

To find the rows including the exact keywords, I used pd.Series.isin:

df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]

Even though I see there are matches in df1, it doesn't return anything. Can someone please help me with this? Thanks!

Update:

Answers provided by @Lazyer and @BeRT2me are both correct. I accepted @lazyer's answer because he posted it sooner. However, I'd prefer @@BeRT2me's answer because its short and simple :)

CodePudding user response：

import re

df[df.text.apply(lambda x: any(i for i in re.findall('\w ', x) if i in keywords))]

Output:

                                                text
2  Tell her it’s just the fake flu, it will go aw...
3                                 Is this fake news?

CodePudding user response：

If text is as follows,

df1 = pd.DataFrame()
df1['text'] = [
    "Dear Kellyanne, Please seek the help of Paula White I believe ...",
    "trump saying it was under controll was a lie, ...",
    "Her mouth should hanve been ... All the lies she has told ...",
    "she'll believe ...",
    "I do believe in ...",
    "This value is false ...",
    "This value is fake ...",
    "This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']

First,

Simple way is this.

df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]

                      text
5  This value is false ...
6   This value is fake ...

It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.

Second,

So if remove a special letter in the text data like

new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z] ", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]

Now It can catch the word "lie,".

                                                text
1  trump saying it was under controll was a lie, ...
5                            This value is false ...
6                             This value is fake ...

Third,

It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python

CodePudding user response：

I think splitting words then matching is a better and straightforward approach, e.g. if the df and keywords are

df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']

df

       text
0  lama abc
1   cow def
2   foo bar
3  spam egg

This should return the correct result

df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'\w '))]

       text
0  lama abc
2   foo bar

Explaination

First, do words splitting in df['text']

splits = df['text'].str.findall(r'\w ')

splits is

0    [lama, abc]
1     [cow, def]
2     [foo, bar]
3    [spam, egg]
Name: text, dtype: object

Then we need to find if there exists any word in a row should appear in the keywords

# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)

# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows

0     True
1    False
2     True
3    False
dtype: bool

Now we can find the correct rows with

df.loc[rows]

       text
0  lama abc
2   foo bar

Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.

CodePudding user response：

I believe it's because pd.Series.isin() checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:

s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
               'hippo'], name='animal')

s.isin(['cow', 'lama'])

And as I was thinking, the first string, even containing the word 'lama', returns False.

Maybe try using regex? See this: searching a word in the column pandas dataframe python