I am trying to write a function that will select a subset of a pd.DataFrame
. I want to use multiple selection values for different columns.
For example:
column_a matches 'a' or 'b' or ...
column_b matches 'A' or 'B' or 'X' ...
CodePudding user response:
If you provide your values as variable args or a list you can use pd.Series.isin
# Using `*args`
def select_args(df, column, *args):
return df[df[column].isin(args)]
# Using a list
def select_list(df, column, list_):
return df[df[column].isin(list_)]
Use like so:
>>> df_sub = select_args(df, "indicator", "A", "B", "C")
>>> df_sub = select_list(df, "indicator", ["A", "B", "C"])
For using multiple columns, with many different variables you should structure these data as pairs. For example, you could use a dict
where each key is a column
and the value is a list
:
search_params = {
"column_a": ["a", "b", "c"],
"column_b": ["A", "B", "C"],
}
Then you can create your boolean arrays, combine them using np.all
and subset your DataFrame:
def select_multiple(df, selectors):
bools = [df[col].isin(vals).to_numpy() for col, vals in selectors.items()]
return df[np.all(bools, axis=0)]
>>> df_sub = select_multiple(df, search_params)