Home > Software engineering >  Filter pandas dataframe by complex dynamic conditions
Filter pandas dataframe by complex dynamic conditions

Time:05-05

I need to filter down a pandas dataframe based on conditions for multiple columns. I got these conditions from a dict config file like this:

config = {
 "PLANT_ID": ["KD"],
 "CO_CD": ["V", "R"]
 }

What this means is that I need to filter down the dataset like: if (PLANT_ID starts with KD) or (CO_CD startswith V or R) then I should keep that record. There can be more than 2 columns specified, and more than 2 strings in the list.

I know I can use startswith and convert the list to tuples like this:

df.PLANT_ID.str.startswith(tuple(config['PLANT_ID']))

But I somehow need to write this condition to dynamically pick the column names from the config dict.

CodePudding user response:

IIUC, you can craft a regex for each item in your initial dictionary, then apply it using str.startswith to each column and aggregate with any:

import re
regex = {k: "|".join(map(re.escape, l)) for k,l in config.items()}

m = df[list(config)].apply(lambda c: c.str.startswith(regex[c.name])).any(1)

df2 = df[m]
  • Related