How do I identify rows containing a minimum number of distinct words from a candidate list?-CodePudding

I have a list of words as well as a dataset. I would like to identify rows within the dataset that have at least two of the words in the list.

I am able to identify rows containing at least two of the list words, but my code problematically also identifies rows where a single list word is repeated.

Here's my code:

import pandas as pd
    
data={'Name':['Redred','redblue','redgreen','blue']}

df=pd.DataFrame(data)

df['Good colours'] = (df['Name'].str.contains("(red.*|blue.*|green.*){2,}",case=False, regex=True))

print(df)

In the resulting dataset, the row 'redred' returns true when it should not.

In an effort to exclude repeated words I've tried (red.*){1,}(blue.*){1,}(green.*){1} but this means that 'redblue' and 'redgreen' is no longer recognized (and it is essential for the code to recognize these rows.)

How do I write the regex so it identifies two instances of the list words while excluding repeated instances of the list word?

CodePudding user response：

You can use

(df['Name'].str.contains(r"^(?!(red|blue|green)\1 $)(?:.*(?:red|blue|green)){2}",case=False, regex=True))

If your strings must only consist of words from the list remove .*, and replace {2} with {2,}$:

(df['Name'].str.contains(r"^(?!(red|blue|green)\1 $)(?:red|blue|green){2,}$",case=False, regex=True))

Details:

^ - start of string
(?!(red|blue|green)\1 $) - a negative lookahead that fails the match if there is
- (red|blue|green) - Group 1: any of the substrings defined in the group
- \1 - one or more repetitions of Group 1 value and then
- $ - end of string
(?:.*(?:red|blue|green)){2} - two occurrences of zero or more chars other than line break chars, as many as possible and then an single occurrence of the substrings defined in the group.
(?:red|blue|green){2,}$ - two or more occurrences of the word from group and then end of string.

See the regex demo #1 and regex demo #2.

CodePudding user response：

Sometimes regex isn't the tool for the job.

I would simply use separate calls to determine which rows contain red, green and blue, and then see which ones are in at least two of those groups.

def name_has_color(df, color):
    return df['Name'].str.contains(color, case=False).astype(int)

color_counts = sum(
    name_has_color(df, color)
    for color in ['red', 'green', 'blue']
)

df['good_colors'] = df[color_counts >= 2]

This approach is naturally extensible and will not get more complex if you need to check for a larger number of contained values or if there are more looked-for values.