Dynamic comparison of values of n multiple Pandas columns-CodePudding

Let's say a user can input the columns and values to compare for a DF, so we can have:

column_list = ['col1', 'col2', 'col3']
value_list = [val1, val2, val3]

So to select the rows that satisfy where col1 >= val1 AND col2 >= val2 AND col3 >= val3 we would write:

selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))

or it can be in the form:

selection  = df.loc[(df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3)]

The number of columns is not known in advance, so we can have n columns. We can try this approach:

if n=1:
   selection = (df['col1'] >= val1))
elif n=2:
   selection = (df['col1'] >= val1) & (df['col2'] >= val2))
elif n=3:
   selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))

But this is neither scalable nor efficient. I tried by generating strings "df['col<>'] > val<>)" with a foor loop given the input lists but it didn't work for Pandas.

What would be the best pythonic approach for this? To avoid having all the options with if and else statements.

Thank you in advance!

CodePudding user response：

To perform a comparison with the same operator for all columns, create a Series with the values and columns ids and use it to perform an aligned comparison with the dataframe:

df[df.gt(pd.Series(value_list, index=column_list)).all(1)]

Example input:

>>> value_list
[3, 7, 11]
>>> df
   col1  col2  col3
0     0     1     2
1     3     4     5
2     6     7     8
3     9    10    11
4    12    13    14

output:

   col1  col2  col3
4    12    13    14

intermediates:

>>> pd.Series(value_list, index=column_list)
col1     3
col2     7
col3    11

>>> df.gt(pd.Series(value_list, index=column_list))
    col1   col2   col3
0  False  False  False
1  False  False  False
2   True  False  False
3   True   True  False
4   True   True   True

>>> df.gt(pd.Series(value_list, index=column_list)).all(1)
0    False
1    False
2    False
3    False
4     True

CodePudding user response：

With a df like this,

And values, columns, and arbitrary operators,

In [2]: import operator

In [3]: values = [1, 2, 7]

In [4]: columns = ['a', 'b', 'c']

In [5]: operators = [operator.gt, operator.ge, operator.le]  # >,  >=,  <=

Make a copy of df and iterate over the zipped items:

In [6]: selection = df.copy()

In [7]: for col, op, val in zip(columns, operators, values):
   ...:     selection = selection[op(selection[col], val)]
   ...:

In [8]: selection
Out[8]:
   a  b  c
2  6  2  6
3  2  6  3
5  5  9  0

Of course, if you don't know ahead of time how many columns there are, then it seems as though you also likely don't know the operators ahead of time either, which sort of defeats the purpose. This would become much easier if you only had to use a single operator but it looks like that's not the case, judging by your examples.

If your original post does indeed have typos (should all comparisons be >, or all of them >=?) and you actually intend to perform a single comparison operation, see this answer from @mozway.

CodePudding user response：

Here is another possibility to handle an arbitrary long list of comparisons (I am posting separately as the approach is quite different):

column_list = ['col1', 'col2', 'col3']
value_list = [3, 7, 11]
operator_list = [pd.Series.gt, pd.Series.ge, pd.Series.gt]

op_dic = dict(zip(column_list, operator_list))
val_dic = dict(zip(column_list, value_list))

df[df.apply(lambda c: op_dic[c.name](c, val_dic[c.name])).all(1)]

How it works:

Using apply, we perform a custom operation on all columns that will return a boolean per row, then take the rows that are True for all.

input:

   col1  col2  col3
0     0     1     2
1     3     4     5
2     6     7     8
3     9    10    11
4    12    13    14

output:

   col1  col2  col3
4    12    13    14

NB. to be more concise one can also do:

from operator import gt, ge
operator_list = [gt, ge, gt]