I have a list of words as well as a dataset. I would like to identify rows within the dataset that have at least two of the words in the list.
I am able to identify rows containing at least two of the list words, but my code problematically also identifies rows where a single list word is repeated.
Here's my code:
import pandas as pd
data={'Name':['Redred','redblue','redgreen','blue']}
df=pd.DataFrame(data)
df['Good colours'] = (df['Name'].str.contains("(red.*|blue.*|green.*){2,}",case=False, regex=True))
print(df)
In the resulting dataset, the row 'redred' returns true when it should not.
In an effort to exclude repeated words I've tried (red.*){1,}(blue.*){1,}(green.*){1}
but this means that 'redblue' and 'redgreen' is no longer recognized (and it is essential for the code to recognize these rows.)
How do I write the regex so it identifies two instances of the list words while excluding repeated instances of the list word?
CodePudding user response:
You can use
(df['Name'].str.contains(r"^(?!(red|blue|green)\1 $)(?:.*(?:red|blue|green)){2}",case=False, regex=True))
If your strings must only consist of words from the list remove .*
, and replace {2}
with {2,}$
:
(df['Name'].str.contains(r"^(?!(red|blue|green)\1 $)(?:red|blue|green){2,}$",case=False, regex=True))
Details:
^
- start of string(?!(red|blue|green)\1 $)
- a negative lookahead that fails the match if there is(red|blue|green)
- Group 1: any of the substrings defined in the group\1
- one or more repetitions of Group 1 value and then$
- end of string
(?:.*(?:red|blue|green)){2}
- two occurrences of zero or more chars other than line break chars, as many as possible and then an single occurrence of the substrings defined in the group.(?:red|blue|green){2,}$
- two or more occurrences of the word from group and then end of string.
See the regex demo #1 and regex demo #2.
CodePudding user response:
Sometimes regex isn't the tool for the job.
I would simply use separate calls to determine which rows contain red, green and blue, and then see which ones are in at least two of those groups.
def name_has_color(df, color):
return df['Name'].str.contains(color, case=False).astype(int)
color_counts = sum(
name_has_color(df, color)
for color in ['red', 'green', 'blue']
)
df['good_colors'] = df[color_counts >= 2]
This approach is naturally extensible and will not get more complex if you need to check for a larger number of contained values or if there are more looked-for values.