I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.
keywords = ['fake', 'false', 'lie']
df1:
text | |
---|---|
19152 | I think she is the Corona Virus.... |
19154 | Boy you hate to see that. I mean seeing how it was contained and all. |
19155 | Tell her it’s just the fake flu, it will go away in a few days. |
19235 | Is this fake news? |
... | ... |
20540 | She’ll believe it’s just alternative facts. |
Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.
str.contains()
df1[df1['text'].str.contains("|".join(keywords))]
The problem with str.contains()
is that the result is not limited to the exact keywords. For example, it returns sentences with believe
(e.g., row 20540) because lie
is a substring of "believe"!
pandas.Series.isin
To find the rows including the exact keywords, I used pd.Series.isin:
df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]
Even though I see there are matches in df1, it doesn't return anything. Can someone please help me with this? Thanks!
Update:
Answers provided by @Lazyer and @BeRT2me are both correct. I accepted @lazyer's answer because he posted it sooner. However, I'd prefer @@BeRT2me's answer because its short and simple :)
CodePudding user response:
import re
df[df.text.apply(lambda x: any(i for i in re.findall('\w ', x) if i in keywords))]
Output:
text
2 Tell her it’s just the fake flu, it will go aw...
3 Is this fake news?
CodePudding user response:
If text is as follows,
df1 = pd.DataFrame()
df1['text'] = [
"Dear Kellyanne, Please seek the help of Paula White I believe ...",
"trump saying it was under controll was a lie, ...",
"Her mouth should hanve been ... All the lies she has told ...",
"she'll believe ...",
"I do believe in ...",
"This value is false ...",
"This value is fake ...",
"This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']
First,
Simple way is this.
df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
text
5 This value is false ...
6 This value is fake ...
It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.
Second,
So if remove a special letter in the text data like
new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z] ", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
Now It can catch the word "lie,".
text
1 trump saying it was under controll was a lie, ...
5 This value is false ...
6 This value is fake ...
Third,
It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python
CodePudding user response:
I think splitting words then matching is a better and straightforward approach, e.g. if the df
and keywords
are
df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']
df
text
0 lama abc
1 cow def
2 foo bar
3 spam egg
This should return the correct result
df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'\w '))]
text
0 lama abc
2 foo bar
Explaination
First, do words splitting in df['text']
splits = df['text'].str.findall(r'\w ')
splits
is
0 [lama, abc]
1 [cow, def]
2 [foo, bar]
3 [spam, egg]
Name: text, dtype: object
Then we need to find if there exists any word in a row should appear in the keywords
# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)
# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows
0 True
1 False
2 True
3 False
dtype: bool
Now we can find the correct rows with
df.loc[rows]
text
0 lama abc
2 foo bar
Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.
CodePudding user response:
I believe it's because pd.Series.isin()
checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:
s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
'hippo'], name='animal')
s.isin(['cow', 'lama'])
And as I was thinking, the first string, even containing the word 'lama', returns False.
Maybe try using regex? See this: searching a word in the column pandas dataframe python