Home > Enterprise >  How to Test multiple columns of pandas for a condition at once and update them
How to Test multiple columns of pandas for a condition at once and update them

Time:11-21

I have a data frame like this

test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test

Data Frame looks like

   col1 col2 col3
0   10  5    6
1   20  10   12
2   30  15   18
3   40  20   24 

I wanna replace the values which are greater than 10 in col2 or col3 with zero. I wanna use loc function for this purpose.
My desired output is:

   col1 col2 col3
0   10  5    6
1   20  10   0
2   30  0    0
3   40  0    0

I have tried following solution

cols_to_update = ['col2', 'col3']
test.loc[test[cols_to_update]>10]=0
test

It shows following error:

KeyError: "None of [Index([('c', 'o', 'l', '1'), ('c', 'o', 'l', '2')], dtype='object')] are in the [index]"

When I use single column to test the condition it doesn't show 'KeyError' but now it replaces values in other two columns also

test.loc[test['col2']>10]=0
test

Output is

   col1 col2 col3
0   10  5    6
1   0   0    0
2   0   0    0
3   0   0    0

Please guide

  1. Can we use loc for this purpose
  2. Why loc behaving like this
  3. What is the efficient solution I have gone through similar questions but not found exact explanation. I am ML student and new to this plate form, pardon me if question is not properly phrased or formatted. Thank you

CodePudding user response:

I would use numpy.where to conditionally replace values of multiple columns:

import numpy as np

cols_to_update = ['col2', 'col3']
test[cols_to_update] = np.where(test[cols_to_update] > 10, 0, test[cols_to_update])

The expression test[cols_to_update] > 10 gives you a boolean mask:

    col2   col3
0  False  False
1  False   True
2   True   True
3   True   True

Then, np.where picks the value 0 whenever this mask is True or it picks the corresponding original data test[cols_to_update] whenever the mask is False.


Your solution test.loc[test[cols_to_update]>10]=0 doesn't work because loc in this case would require a boolean 1D series, while test[cols_to_update]>10 is still a DataFrame with two columns. This is also the reason why you cannot use loc for this problem (at least not without looping over the columns): The indices where the values of columns 2 and 3 meet the condition > 10 are different.

When would loc be appropriate in this case? For example if you wanted to set both columns 2 and 3 to zero when any of the two is greater than 10:

test.loc[(test[cols_to_update] > 10).any(axis=1), cols_to_update] = 0
test
# out:
   col1  col2  col3
0    10     5     6
1    20     0     0
2    30     0     0
3    40     0     0

In this case you index with a 1D Series ((test[cols_to_update] > 10).any(axis=1)), which is an appropriate use case for loc.

CodePudding user response:

You can use where:

import pandas as pd

test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test[['col2', 'col3']] = test[['col2', 'col3']].where(test[['col2', 'col3']] <= 10, 0)

output:

col1 col2 col3
0 10 5 6
1 20 10 0
2 30 0 0
3 40 0 0
  • Related