I need to filter down a pandas dataframe based on conditions for multiple columns. I got these conditions from a dict config file like this:
config = {
"PLANT_ID": ["KD"],
"CO_CD": ["V", "R"]
}
What this means is that I need to filter down the dataset like: if (PLANT_ID starts with KD) or (CO_CD startswith V or R) then I should keep that record. There can be more than 2 columns specified, and more than 2 strings in the list.
I know I can use startswith and convert the list to tuples like this:
df.PLANT_ID.str.startswith(tuple(config['PLANT_ID']))
But I somehow need to write this condition to dynamically pick the column names from the config dict.
CodePudding user response:
IIUC, you can craft a regex for each item in your initial dictionary, then apply
it using str.startswith
to each column and aggregate with any
:
import re
regex = {k: "|".join(map(re.escape, l)) for k,l in config.items()}
m = df[list(config)].apply(lambda c: c.str.startswith(regex[c.name])).any(1)
df2 = df[m]