Home > Software engineering >  function returns only None values when replacing pandas column values by regex match
function returns only None values when replacing pandas column values by regex match

Time:11-07

Goal: replace values in column que_text with matches of re.search pattern. Else None

Problem: Receiving only None values in que_text_new column although regex pattern is thoroughly tested!

def override(s):
    x = re.search(r'(an|frage(\s ich)?)\s d(i|ı)e\s Staatsreg(i|ı)erung(.*)(Dresden(\.|,|\s )?)?', str(s), flags = re.DOTALL | re.MULTILINE))
    if x :
        return x.group(5)
    return None
df2['que_text_new'] = df2['que_text'].apply(override)

What am i doing wrong? removing return None doesent help. There must be some structural error within my function, i assume.

CodePudding user response:

You can use a pattern with a single capturing group and then simpy use Series.str.extract and chain .fillna(np.nan) to fill the non-matched values with NaN:

pattern = r'(?s)(?:an|frage(?:\s ich)?)\s d[iı]e\s Staatsreg[iı]erung(.*)'
df2['que_text_new'] = df2['que_text'].astype(str).str.extract(pattern).fillna(np.nan)

Not sure you need .astype(str), but there is str(s) in your code, so it might be safer with this part.

Here,

  • Capturing groups with single char alternatives are converted to character classes, e.g. (i|ı) -> [iı]
  • Other capturing groups are converted to non-capturing ones, i.e. ( -> (?:.
  • To make np.nan work do not forget to import numpy as np.
  • (?s) is an in-pattern re.DOTALL option.
  • Related