Home > Blockchain >  filter pandas df with list comprehension instead of multiple ampersands
filter pandas df with list comprehension instead of multiple ampersands

Time:07-20

Let's say I have a dataframe with the columns dataset, inflation, cat, region, and some other values.

I want to filter my df according to the values of a list (or tuple) where the list has the values of those columns that I want to select.

My filter list will look like

filterlist=['abc', 'real', 'food', 'central']

What I'm doing now is

DF.loc[(DF['dataset']==filterlist[0]) & (DF['inflation']==filterlist[1]) & (DF['cat']==filterlist[2]) & (DF['region']==filterlist[3])]

I'd like to do something more like this:

id_cols=['dataset', 'inflation', 'cat', 'region']
DF.loc[[DF[x]==y for x,y in zip(id_cols, filterlist)]]

I hacked this together which technically works but isn't very readable, just looks clunky, and has terrible %%timeit performance (1.18ms vs 654 µs)

DF.iloc[list(set.intersection(*[set(DF[DF[id_cols[x]]==filterlist[x]].index) for x in range(len(id_cols))])),:]

I also did

DF.query(' and '.join([f"{x} == '{y}'" for x,y in zip(id_cols, filterlist)]))

This has the benefit of being fairly readable but is even worse performance. There's got to be a better way!!

CodePudding user response:

The answer is np.logical_and.reduce

Thanks to this answer

I found that I can do:

DF[np.logical_and.reduce([DF[x]==y for x,y in zip(id_cols, filterlist)])]

The %%timeit on this one is just 470 µs which is even better than the parenthesis laden example.

CodePudding user response:

Use 2D comparison:

cols = ['dataset', 'inflation', 'cat', 'region']

out = df[df[cols].eq(filterlist).all(1)]

output:

  dataset inflation   cat   region
0     abc      real  food  central

example input df:

  dataset inflation   cat   region
0     abc      real  food  central
1     abc         X  food  central
2     abc      real     X  central

intermediates:

df[cols].eq(filterlist)
   dataset  inflation    cat  region
0     True       True   True    True
1     True      False   True    True
2     True       True  False    True

df[cols].eq(filterlist).all(1)
0     True
1    False
2    False
dtype: bool
  • Related