I am trying to set 2 columns by a condition on a 3rd column. I can set 1 column conditions on another column, and I can set 2 columns on a single condition value, but when I try to set 2 columns by a condition on a column, it fails.
Here is the code example:
import pandas as pd
import numpy as np
AAA = {"column A": [1, 1, 1, 2, 2, 2, 3, 3, 3]}
df = pd.DataFrame(AAA)
If I call:
df["column B"], df["column C"] = np.where(True ,['4', '8'],['NaN', 'NaN'])
I get:
df
column A column B column C
0 1 4 8
1 1 4 8
2 1 4 8
3 2 4 8
4 2 4 8
5 2 4 8
6 3 4 8
7 3 4 8
8 3 4 8
so I know I can fill 2 columns based on a condition.
If I call:
df["column B"] = np.where( df["column A"] == 2 ,['4'],['NaN'])
I get:
column A column B column C
0 1 NaN 8
1 1 NaN 8
2 1 NaN 8
3 2 4 8
4 2 4 8
5 2 4 8
6 3 NaN 8
7 3 NaN 8
8 3 NaN 8
so I know I can fill based on a condition on a column. (I assume this is treated as a boolean array)
However, If I call:
df["column B"], df["column C"] = np.where( df["column A"] == 2 ,['4', '8'],['NaN', 'NaN'])
I expect to get
column A column B column C
0 1 NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 2 4 8
4 2 4 8
5 2 4 8
6 3 NaN NaN
7 3 NaN NaN
8 3 NaN NaN
but I actually get:
Traceback (most recent call last):
... pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<string>", line 2, in <module>
File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (9,) (2,) (2,)
Is there a way to do what I am trying to do? I don't want to use 2 separate calls, because the dataframes I need this for are very large.
CodePudding user response:
use loc
indexer and give value
df.loc[df['column A'] == 2, ['column B', 'column C']] = [4, 8]
output(df
):
column A column B column C
0 1 NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 2 4.0 8.0
4 2 4.0 8.0
5 2 4.0 8.0
6 3 NaN NaN
7 3 NaN NaN
8 3 NaN NaN
CodePudding user response:
Maybe you can loop outside np.where
:
df["column B"], df["column C"] = [np.where( df["column A"] == 2 ,true_val,'NaN') for true_val in ['4','8']]
print(df)
# column A column B column C
# 0 1 NaN NaN
# 1 1 NaN NaN
# 2 1 NaN NaN
# 3 2 4 8
# 4 2 4 8
# 5 2 4 8
# 6 3 NaN NaN
# 7 3 NaN NaN
# 8 3 NaN NaN
CodePudding user response:
Here is one way. While it is not the most elegant code - it should help you understand what is required.
import pandas as pd
import numpy as np
AAA={"column A": [1, 1, 1, 2, 2, 2, 3, 3, 3]}
df = pd.DataFrame(AAA)
col_length = len(df['column A'])
fours = np.repeat(4, col_length, axis =0)
eights = np.repeat(8, col_length, axis =0)
empties = np.repeat(np.nan, col_length, axis =0)
df["column B"], df["column C"] = np.where( df["column A"] == 2 ,[fours, eights], [empties, empties])
print(df)
OUTPUT:
column A column B column C
0 1 NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 2 4.0 8.0
4 2 4.0 8.0
5 2 4.0 8.0
6 3 NaN NaN
7 3 NaN NaN
8 3 NaN NaN
CodePudding user response:
You are almost there! It is just a matter of "broadcasting".
You could use any of the problems suggested by the others. Or use your same concept but reshape the inputs a little.
Like this:
# Reshape the condition, then transpose the output.
df["column B"], df["column C"] = np.where( np.array(df["column A"] == 2).reshape(-1,1) ,['4', '8'],['NaN', 'NaN']).T
Or like this:
# Or just reshape the lists
df["column B"], df["column C"] = np.where( df["column A"] == 2 ,np.array(['4', '8']).reshape(-1,1),np.array(['NaN', 'NaN']).reshape(-1,1))
Output:
column A column B column C
0 1 NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 2 4 8
4 2 4 8
5 2 4 8
6 3 NaN NaN
7 3 NaN NaN
8 3 NaN NaN
You can check numpy documentation on broadcasting to get the idea: https://numpy.org/doc/stable/user/basics.broadcasting.html