I am trying to determine whether the word "McDonald" is in the cell. However, I wish to ignore the case where the word before "McDonald" has a first captilized letter like 'Kevin McDonald'. Any suggestion how to get it right through regex in a pandas dataframe?
data = {'text':["Kevin McDonald has bought a burger.",
"The best burger in McDonald is cheeze buger."]}
df = pd.DataFrame(data)
long_list = ['McDonald', 'Five Guys']
# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
text
0 Kevin McDonald has bought a burger.
1 The best burger in McDonald is cheeze buger.
Expected output:
text count
0 Kevin McDonald has bought a burger. 0
1 The best burger in McDonald is cheeze buger. 1
CodePudding user response:
You can try this pattern:
pattern = r'\b[a-z].*?\b {}'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
CodePudding user response:
IIUC, the goal is not to match when there is a preceding word that is capitalized. Checking that there is a non capitalized word before would remove many legitimate possibilities.
Here is a regex that works for a few more possibilities (start of sentence, non word before):
regex = '|'.join(fr'(?:\b[^A-Z]\S*\s |[^\w\s] ?|^){i}' for i in long_list)
df['count'] = df['text'].str.count(regex)
example:
text count
0 Kevin McDonald has bought a burger. 0
1 The best burger in McDonald is cheeze buger. 1
2 McDonald's restaurants. 1
3 Blah. McDonald's restaurants. 1
You can test and understand the regex here