I have a codebase with a bash script that runs a job that includes filtering the rows of a pandas data frame with multiple criteria, i.e.:
df.loc[(df['A'] > 5) &
df['B'].isin([1,5,9]) &
(df['C'] == "blue")]
Suppose I want this to be generalizable so that anyone using this codebase can specify their own list of filter criteria when running the bash script. I can imagine this would be relatively straightforward if we were just selecting specific values using a json string that is converted to a python dictionary:
filter_criteria={"B": [1,5,9], "C": ["blue"]}
for col, values in filter_criteria.items():
df = df.loc[df[col].isin(values)]
The one thing I can't think of how to accommodate is >
and <
or potentially down the road, the ability to do more complex filtering (e.g. df.loc[(df['A']<5) | (df['A']>10)
). Does anyone have any ideas on how the operator can be specified as part of the json dictionary?
CodePudding user response:
Create a query string and use query
to filter your dataframe.
Suppose this dataframe:
>>> df
A B C
0 1 2 blue
1 3 3 red
2 5 5 yellow
3 7 6 blue
4 9 9 blue
qs = "A > 5 & B.isin([1,5,9]) & C == 'blue'"
out = df.query(qs)
print(out)
# Output:
A B C
4 9 9 blue