Home > Blockchain >  How can I fill two columns of a dataframe with "np.where"?
How can I fill two columns of a dataframe with "np.where"?

Time:11-14

I am trying to set 2 columns by a condition on a 3rd column. I can set 1 column conditions on another column, and I can set 2 columns on a single condition value, but when I try to set 2 columns by a condition on a column, it fails.

Here is the code example:

import pandas as pd
import numpy as np
AAA = {"column A": [1, 1, 1, 2, 2, 2, 3, 3, 3]}
df = pd.DataFrame(AAA)

If I call:

df["column B"], df["column C"] = np.where(True ,['4', '8'],['NaN', 'NaN'])

I get:

df
   column A column B column C
0         1        4        8
1         1        4        8
2         1        4        8
3         2        4        8
4         2        4        8
5         2        4        8
6         3        4        8
7         3        4        8
8         3        4        8

so I know I can fill 2 columns based on a condition.

If I call:

df["column B"] = np.where( df["column A"] == 2 ,['4'],['NaN'])

I get:

   column A column B column C
0         1      NaN        8
1         1      NaN        8
2         1      NaN        8
3         2        4        8
4         2        4        8
5         2        4        8
6         3      NaN        8
7         3      NaN        8
8         3      NaN        8

so I know I can fill based on a condition on a column. (I assume this is treated as a boolean array)

However, If I call:

df["column B"], df["column C"] = np.where( df["column A"] == 2 ,['4', '8'],['NaN', 'NaN'])

I expect to get

   column A column B column C
0         1      NaN        NaN        
1         1      NaN        NaN        
2         1      NaN        NaN        
3         2        4        8
4         2        4        8
5         2        4        8
6         3      NaN        NaN        
7         3      NaN        NaN        
8         3      NaN        NaN        

but I actually get:

Traceback (most recent call last):
 ... pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<string>", line 2, in <module>
  File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (9,) (2,) (2,) 

Is there a way to do what I am trying to do? I don't want to use 2 separate calls, because the dataframes I need this for are very large.

CodePudding user response:

use loc indexer and give value

df.loc[df['column A'] == 2, ['column B', 'column C']] = [4, 8]

output(df):

    column A    column B    column C
0   1           NaN         NaN
1   1           NaN         NaN
2   1           NaN         NaN
3   2           4.0         8.0
4   2           4.0         8.0
5   2           4.0         8.0
6   3           NaN         NaN
7   3           NaN         NaN
8   3           NaN         NaN

CodePudding user response:

Maybe you can loop outside np.where:

df["column B"], df["column C"] = [np.where( df["column A"] == 2 ,true_val,'NaN') for true_val in ['4','8']]

print(df)
# column A column B column C
# 0         1      NaN      NaN
# 1         1      NaN      NaN
# 2         1      NaN      NaN
# 3         2        4        8
# 4         2        4        8
# 5         2        4        8
# 6         3      NaN      NaN
# 7         3      NaN      NaN
# 8         3      NaN      NaN

CodePudding user response:

Here is one way. While it is not the most elegant code - it should help you understand what is required.

import pandas as pd
import numpy as np

AAA={"column A": [1, 1, 1, 2, 2, 2, 3, 3, 3]}
df = pd.DataFrame(AAA)

col_length = len(df['column A'])
fours = np.repeat(4, col_length, axis =0)
eights = np.repeat(8, col_length, axis =0)
empties = np.repeat(np.nan, col_length, axis =0)

df["column B"], df["column C"] = np.where( df["column A"] == 2 ,[fours, eights], [empties, empties])
print(df)

OUTPUT:

   column A  column B  column C
0         1       NaN       NaN
1         1       NaN       NaN
2         1       NaN       NaN
3         2       4.0       8.0
4         2       4.0       8.0
5         2       4.0       8.0
6         3       NaN       NaN
7         3       NaN       NaN
8         3       NaN       NaN

CodePudding user response:

You are almost there! It is just a matter of "broadcasting".

You could use any of the problems suggested by the others. Or use your same concept but reshape the inputs a little.

Like this:

# Reshape the condition, then transpose the output.
df["column B"], df["column C"] = np.where( np.array(df["column A"] == 2).reshape(-1,1) ,['4', '8'],['NaN', 'NaN']).T

Or like this:

# Or just reshape the lists
df["column B"], df["column C"] = np.where( df["column A"] == 2 ,np.array(['4', '8']).reshape(-1,1),np.array(['NaN', 'NaN']).reshape(-1,1))

Output:

    column A    column B    column C
0   1   NaN NaN
1   1   NaN NaN
2   1   NaN NaN
3   2   4   8
4   2   4   8
5   2   4   8
6   3   NaN NaN
7   3   NaN NaN
8   3   NaN NaN

You can check numpy documentation on broadcasting to get the idea: https://numpy.org/doc/stable/user/basics.broadcasting.html

  • Related