Fastest way to go through regex patterns and return labels in dataframe-CodePudding

I'm trying to figure out how to speed up the below regex process:

I have a few strings of regex patterns, I utilize these unioned regex patterns to label a column in a Pandas Dataframe.

So for example, one unioned regex string will look like the following (below); if any of the following patterns are in the Pandas column, label that row "Auto":

#example of unioned regex pattern string
auto_pattern = '''(
                  (ACURA)|
                  (ALFA ROMEO)|
                  (\bAUDI\b)|
                  (BMW\s?FINANCIAL)|
                  (FERRARI)|
                  (FIAT)|
                  (FORD MO?TO?R)|
                  ((?=.*GMC?).*?(?=.*BUICK))|
                  (HONDA)|
                  (HYUNDAI)|
                  (INFINITI)|
                  (JAGUAR.*?(LAND))|
                  (JEEP(CHRYSL?E?R?)?)|
                  (\sKIA\s)|
                  (LEXUS)|
                  (MASERATI)|
                  (MAZDA)|
                  (MBFS.COM)|
                  (MERCEDES\S?(BENZ)?)|
                  (NISSAN)|
                  (PORSCHE)|
                  (ROLLS\s?ROYCE)|
                  (TESLA)|
                  (TOYOTA)|
                  (VCFS).*(LEASE)|
                  (VOLVO))(?!.*PAYROLL)
               '''

The below labeler function will be used on the DataFrame:

#function that applys patterns to dataframe 
def labeler(text_string):
      if re.search(typeone_pattern, text_string, flags=re.I) and not('Random Word Here') in text_string:
            return 'Label 1'
      if re.search(auto_pattern, text_string, flags=re.I | re.X):
            return 'Label Auto'
      if re.search(typetwo_pattern, text_string, flags=re.I):
            return 'Label 2'
      
      # a few more if re.search checks here......

      return 'Other'


df['label'] = df['string_column'].apply(lambda x: labeler(x))

Any ideas on how to make this more efficient/faster? Currently for 1 million rows it takes about 1 minute.

I tried using a Trie data structure, but that seems to only work with words, and not regex patterns per say. From my research it seems as though the fastest way to loop through regex patterns is to union them as I've done above in the string.

Maybe my labeler function can be more efficient? Or should I use something other than .apply in Pandas? Not sure...

Any ideas are greatly appreciated.

CodePudding user response：

At least, you can pre-compile your regular expressions, and not compile them in every loop cycle, which is essentially what the re.* top-level functions do.

First, instead of declaring the pattern string in a variable, compile the pattern string into a regular expression, then and there, in a variable:

#example of unioned regex pattern string
auto_pattern = re.compile('''(
                  (ACURA)|
                  ...
                  (VOLVO))(?!.*PAYROLL)
               ''',
               flags=re.I | re.X)

(and do the same for the other vars, typeone_pattern and typetwo_pattern.)

Then, instead of calling re.search(<var>, 'text'), you can call <var>.search('text'):

#function that applys patterns to dataframe 
def labeler(text_string):
      if typeone_pattern.search(text_string) and not 'Random Word Here' in text_string:
            return 'Label 1'
      if auto_pattern.search(text_string):
            return 'Label Auto'
      if typetwo_pattern.search(text_string):
            return 'Label 2'
      
      # a few more if re.search checks here......

      return 'Other'


df['label'] = df['string_column'].apply(lambda x: labeler(x))