Home > front end >  Identify strings having words from two different lists
Identify strings having words from two different lists

Time:04-09

I have a dataframe with three columns like this:

index   string                                         Result
1       The quick brown fox jumps over the lazy dog 
2       fast and furious was a good movie   

and i have two lists of words like this:

list1   ["over", "dog", "movie"]
list2   ["quick", "brown", "sun", "book"]

I want to identify strings that have at least one word from list1 AND at least one word from list2, such that the result will be as follows:

index   string                                      Result
1   The quick brown fox jumps over the lazy dog     TRUE
2   fast and furious was a good movie               FALSE

Explanation: The first sentence has words from both lists and so the result is TRUE. The second sentence has only one word from list1 and so it has a result of False.

Can we do that with python? I used search techniques from NLTK but i don't know how to combine results from the two lists. Thanks

CodePudding user response:

If your dataframe (with the first two columns) is called df, you can do the following:

df['Result'] = (df['string'].str.contains('|'.join(list1)) 
 & df['string'].str.contains('|'.join(list2)))

The result:

                                        string  Result
0  The quick brown fox jumps over the lazy dog    True
1            fast and furious was a good movie   False

CodePudding user response:

Another option is to split the strings and use set.intersection with all in a list comprehension:

s_lists = [set(list1), set(list2)]
df['Result'] = [all(s_lst.intersection(s.split()) for s_lst in s_lists) for s in df['string'].tolist()]

Output:

   index                                       string  Result
0      1  The quick brown fox jumps over the lazy dog    True
1      2            fast and furious was a good movie   False
  • Related