Let's say I have a dataframe with the columns dataset
, inflation
, cat
, region
, and some other values.
I want to filter my df according to the values of a list (or tuple) where the list has the values of those columns that I want to select.
My filter list will look like
filterlist=['abc', 'real', 'food', 'central']
What I'm doing now is
DF.loc[(DF['dataset']==filterlist[0]) & (DF['inflation']==filterlist[1]) & (DF['cat']==filterlist[2]) & (DF['region']==filterlist[3])]
I'd like to do something more like this:
id_cols=['dataset', 'inflation', 'cat', 'region']
DF.loc[[DF[x]==y for x,y in zip(id_cols, filterlist)]]
I hacked this together which technically works but isn't very readable, just looks clunky, and has terrible %%timeit
performance (1.18ms vs 654 µs)
DF.iloc[list(set.intersection(*[set(DF[DF[id_cols[x]]==filterlist[x]].index) for x in range(len(id_cols))])),:]
I also did
DF.query(' and '.join([f"{x} == '{y}'" for x,y in zip(id_cols, filterlist)]))
This has the benefit of being fairly readable but is even worse performance. There's got to be a better way!!
CodePudding user response:
The answer is np.logical_and.reduce
Thanks to this answer
I found that I can do:
DF[np.logical_and.reduce([DF[x]==y for x,y in zip(id_cols, filterlist)])]
The %%timeit
on this one is just 470 µs which is even better than the parenthesis laden example.
CodePudding user response:
Use 2D comparison:
cols = ['dataset', 'inflation', 'cat', 'region']
out = df[df[cols].eq(filterlist).all(1)]
output:
dataset inflation cat region
0 abc real food central
example input df
:
dataset inflation cat region
0 abc real food central
1 abc X food central
2 abc real X central
intermediates:
df[cols].eq(filterlist)
dataset inflation cat region
0 True True True True
1 True False True True
2 True True False True
df[cols].eq(filterlist).all(1)
0 True
1 False
2 False
dtype: bool