I am selecting multiple rows based on a condition, and updating values in multiple columns. This works unless one of the values is a list.
First, a dataframe:
>>> dummy_data = {'A': ['abc', 'def', 'def'],
'B': ['red', 'purple', 'blue'],
'C': [25, 94, 57],
'D': [False, False, False],
'E': [[9,8,12], [36,72,4], [18,3,5]]}
>>> df = pd.DataFrame(dummy_data)
A B C D E
0 abc red 25 False [9, 8, 12]
1 def purple 94 False [36, 72, 4]
2 def blue 57 False [18, 3, 5]
Things that work:
This works to select multiple rows and update multiple columns:
>>> df.loc[df['A'] == 'def', ['B', 'C', 'D']] = ['orange', 42, True]
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 True [36, 72, 4]
2 def orange 42 True [18, 3, 5]
This works to update column E with a new list:
>>> new_list = [1,2,3]
>>> df.loc[df['A'] == 'def', ['E']] = pd.Series([new_list] * len(df))
A B C D E
0 abc red 25 False [9, 8, 12]
1 def purple 94 False [1, 2, 3]
2 def blue 57 False [1, 2, 3]
But how to do both?
I can't figure out an elegant way to combine these approaches.
Attempt 1 This works, but I get the ndarray from ragged nested sequences warning:
>>> new_list = [1,2,3]
>>> updates = ['orange', 42, new_list]
>>> num_rows = df.A.eq('def').sum()
>>> df.loc[df['A'] == 'def', ['B', 'C', 'E']] = [updates] * num_rows
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences ...
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
Attempt 2 This works, but seems overly complicated:
>>> new_list = [1,2,3]
>>> updates = ['orange', 42, new_list]
>>> num_rows = df.A.eq('def').sum()
>>> df2 = pd.DataFrame([updates] * num_rows)
>>> df.loc[df['A'] == 'def', ['B', 'C', 'E']] = df2[[0, 1, 2]].values
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
CodePudding user response:
You can use pandas.DataFrame
to assign/align the new values with the selected columns with the help of a boolean mask.
mask = df['A'] == 'def'
cols = ['B', 'C', 'D', 'E']
new_list = [1,2,3]
updates = ['orange', 42, True, [new_list]]
df.loc[mask, cols] = pd.DataFrame(dict(zip(cols, updates)), index=df.index)
>>> print(df)
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 True [1, 2, 3]
2 def orange 42 True [1, 2, 3]
[Finished in 589ms]
CodePudding user response:
Create a numpy array with object dtype:
df.loc[df['A'] == 'def', ['B', 'C', 'E']] = np.array([updates] * num_rows, dtype='object')
Output:
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
However, as commented, [updates] * num_rows
is a dangerous operation. For example, later you want to modify one of the array value:
df.iloc[-1,-1].append(4)
Then your data becomes (notice the change in row 1 as well):
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3, 4]
2 def orange 42 False [1, 2, 3, 4]