I have a data frame like this
test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test
Data Frame looks like
col1 col2 col3
0 10 5 6
1 20 10 12
2 30 15 18
3 40 20 24
I wanna replace the values which are greater than 10 in col2 or col3 with zero. I wanna use loc function for this purpose.
My desired output is:
col1 col2 col3
0 10 5 6
1 20 10 0
2 30 0 0
3 40 0 0
I have tried following solution
cols_to_update = ['col2', 'col3']
test.loc[test[cols_to_update]>10]=0
test
It shows following error:
KeyError: "None of [Index([('c', 'o', 'l', '1'), ('c', 'o', 'l', '2')], dtype='object')] are in the [index]"
When I use single column to test the condition it doesn't show 'KeyError' but now it replaces values in other two columns also
test.loc[test['col2']>10]=0
test
Output is
col1 col2 col3
0 10 5 6
1 0 0 0
2 0 0 0
3 0 0 0
Please guide
- Can we use loc for this purpose
- Why loc behaving like this
- What is the efficient solution I have gone through similar questions but not found exact explanation. I am ML student and new to this plate form, pardon me if question is not properly phrased or formatted. Thank you
CodePudding user response:
I would use numpy.where
to conditionally replace values of multiple columns:
import numpy as np
cols_to_update = ['col2', 'col3']
test[cols_to_update] = np.where(test[cols_to_update] > 10, 0, test[cols_to_update])
The expression test[cols_to_update] > 10
gives you a boolean mask:
col2 col3
0 False False
1 False True
2 True True
3 True True
Then, np.where
picks the value 0
whenever this mask is True
or it picks the corresponding original data test[cols_to_update]
whenever the mask is False
.
Your solution test.loc[test[cols_to_update]>10]=0
doesn't work because loc in this case would require a boolean 1D series, while test[cols_to_update]>10
is still a DataFrame with two columns. This is also the reason why you cannot use loc for this problem (at least not without looping over the columns): The indices where the values of columns 2 and 3 meet the condition > 10
are different.
When would loc
be appropriate in this case? For example if you wanted to set both columns 2 and 3 to zero when any of the two is greater than 10:
test.loc[(test[cols_to_update] > 10).any(axis=1), cols_to_update] = 0
test
# out:
col1 col2 col3
0 10 5 6
1 20 0 0
2 30 0 0
3 40 0 0
In this case you index with a 1D Series ((test[cols_to_update] > 10).any(axis=1)
), which is an appropriate use case for loc
.
CodePudding user response:
You can use where:
import pandas as pd
test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test[['col2', 'col3']] = test[['col2', 'col3']].where(test[['col2', 'col3']] <= 10, 0)
output:
col1 | col2 | col3 | |
---|---|---|---|
0 | 10 | 5 | 6 |
1 | 20 | 10 | 0 |
2 | 30 | 0 | 0 |
3 | 40 | 0 | 0 |