I have a dataframe with three columns like this:
index string Result
1 The quick brown fox jumps over the lazy dog
2 fast and furious was a good movie
and i have two lists of words like this:
list1 ["over", "dog", "movie"]
list2 ["quick", "brown", "sun", "book"]
I want to identify strings that have at least one word from list1 AND at least one word from list2, such that the result will be as follows:
index string Result
1 The quick brown fox jumps over the lazy dog TRUE
2 fast and furious was a good movie FALSE
Explanation: The first sentence has words from both lists and so the result is TRUE. The second sentence has only one word from list1 and so it has a result of False.
Can we do that with python? I used search techniques from NLTK but i don't know how to combine results from the two lists. Thanks
CodePudding user response:
If your dataframe (with the first two columns) is called df
, you can do the following:
df['Result'] = (df['string'].str.contains('|'.join(list1))
& df['string'].str.contains('|'.join(list2)))
The result:
string Result
0 The quick brown fox jumps over the lazy dog True
1 fast and furious was a good movie False
CodePudding user response:
Another option is to split the strings and use set.intersection
with all
in a list comprehension:
s_lists = [set(list1), set(list2)]
df['Result'] = [all(s_lst.intersection(s.split()) for s_lst in s_lists) for s in df['string'].tolist()]
Output:
index string Result
0 1 The quick brown fox jumps over the lazy dog True
1 2 fast and furious was a good movie False