Home > other >  Replace values in multiple columns using 'numpy.where' function
Replace values in multiple columns using 'numpy.where' function

Time:10-03

I want to change the NA values in multiple columns when certain condition is met. Example data set given below.

    Pool Area   Pool Quality    Pool Type   Pool Condition  Pool Finish
0   800             Good           A              Good        Gunite
1   400             Good           C              Good        Vinyl
2   485             Good           B              Good        Fibreglass
3   360             Poor           C              Poor        Vinyl
4   0               NaN            NaN            NaN         NaN
5   600             Best           A              Best        Gunite
6   500             Best           B              Best        Fibreglass
7   0               NaN            NaN            NaN         NaN
8   750             Best           A              Best        Gunite
9   900             Best           A              Best        Gunite
10  0               NaN            NaN            NaN         NaN
11  900             Best           A              Best        Gunite
12  400             Poor           C              Poor        Fibreglass
13  0               NaN            NaN            NaN         NaN

In the above data, I want to replace the NaN values with 'No Pool' where the 'Pool Area' column has value '0'.

I know I can do it with np.where function and I tried the below code.

df[['Pool Quality', 'Pool Type', 'Pool Condition', 'Pool Finish']] = np.where(df['Pool Area']==0, 'No Pool', df[['Pool Quality', 'Pool Type', 'Pool Condition', 'Pool Finish']])

It isn't working.

I tried it separately, it works (Ref to code below).

df['Pool Quality'] = np.where(df['Pool Area']==0, 'No Pool', df['Pool Quality'])

But when I tried to do for multiple columns in one go, it is not working.

Below is the error I got.

ValueError: operands could not be broadcast together with shapes (2919,) () (2919,5)

NOTE: The above error message is taken from my actual data set where the dimension is 2919 rows and 81 columns.

I don't know what is wrong in my code. Please help me.

CodePudding user response:

Use boolean indexing:

m = df['Pool Area'].eq(0)

df.loc[m] = df.loc[m].fillna('No Pool')
# or
# df[m] = df[m].fillna('No Pool')

# or to limit to given columns
# cols = ['Pool Quality', 'Pool Type', 'Pool Condition', 'Pool Finish']
# df.loc[m, cols] = df.loc[m, cols].fillna('No Pool')

updated df:

    Pool Area Pool Quality Pool Type Pool Condition Pool Finish
0         800         Good         A           Good      Gunite
1         400         Good         C           Good       Vinyl
2         485         Good         B           Good  Fibreglass
3         360         Poor         C           Poor       Vinyl
4           0      No Pool   No Pool        No Pool     No Pool
5         600         Best         A           Best      Gunite
6         500         Best         B           Best  Fibreglass
7           0      No Pool   No Pool        No Pool     No Pool
8         750         Best         A           Best      Gunite
9         900         Best         A           Best      Gunite
10          0      No Pool   No Pool        No Pool     No Pool
11        900         Best         A           Best      Gunite
12        400         Poor         C           Poor  Fibreglass
13          0      No Pool   No Pool        No Pool     No Pool
  • Related