I have this string:
d = {'col1': ['Digital Forms - how to spousal information on DF 2,0']}
I turned it into a dataframe :
df = pd.DataFrame(d)
From this dataframe, I want to match this list of words:
wordlist = ['Digital Forms', 'how', 'spousal', 'DF 2.0']
I used the findall
function with some regex to return my list:
words = df['col1'].str.findall(r"\b(" '|'.join(wordlist) r")\b", flags=re.IGNORECASE)
This was the result:
[Digital Forms, how, spousal, DF 2,0]
I want to get rid of DF 2,0
as it is not supposed to be part of the result. I know in regex the dot (.) is a special character used to match any character. In this case the dot in DF 2.0
is used to match DF 2,0
. I tried to modify my script and include something like '\\.'
to ignore the special character of the dot. Nothing worked for me.
Can someone help me modify the following so it ignores the special character of the dot?
'df['col1'].str.findall(r"\b(" '|'.join(wordlist) r")\b", flags=re.IGNORECASE)'
CodePudding user response:
You may form a regex alternation from your word list using re.escape
to escape the metacharacters:
wordlist = ['Digital Forms', 'how', 'spousal', 'DF 2.0']
regex = r'\b(' '|'.join([re.escape(x) for x in wordlist]) r')\b'
words = df['col1'].str.findall(regex, flags=re.IGNORECASE)