Home > Enterprise >  Count the word but ignore when it has a word with first letter capitalized before
Count the word but ignore when it has a word with first letter capitalized before

Time:09-16

I am trying to determine whether the word "McDonald" is in the cell. However, I wish to ignore the case where the word before "McDonald" has a first captilized letter like 'Kevin McDonald'. Any suggestion how to get it right through regex in a pandas dataframe?

data = {'text':["Kevin McDonald has bought a burger.", 
                "The best burger in McDonald is cheeze buger."]}

df = pd.DataFrame(data)
long_list = ['McDonald', 'Five Guys']

# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))

df['count'] = df.text.str.count(pattern)
                                           text
0           Kevin McDonald has bought a burger.
1  The best burger in McDonald is cheeze buger.

Expected output:

                                           text  count
0           Kevin McDonald has bought a burger.      0
1  The best burger in McDonald is cheeze buger.      1

CodePudding user response:

You can try this pattern:

pattern = r'\b[a-z].*?\b {}'.format('|'.join(long_list))

df['count'] = df.text.str.count(pattern)

CodePudding user response:

IIUC, the goal is not to match when there is a preceding word that is capitalized. Checking that there is a non capitalized word before would remove many legitimate possibilities.

Here is a regex that works for a few more possibilities (start of sentence, non word before):

regex = '|'.join(fr'(?:\b[^A-Z]\S*\s |[^\w\s] ?|^){i}' for i in long_list)
df['count'] = df['text'].str.count(regex)

example:

                                           text  count
0           Kevin McDonald has bought a burger.      0
1  The best burger in McDonald is cheeze buger.      1
2                       McDonald's restaurants.      1
3                 Blah. McDonald's restaurants.      1

You can test and understand the regex here

  • Related