I want to do several changes to some rows in a pandas dataframe. The rows to change are selected based on the contents of some other columns. The dataset is large, and I have not managed to find a solution which is not very slow.
The following toy code illustrates the problem:
import pandas as pd
def change1(s):
if s['a'] == 1:
s[['b', 'c']] = s[['c', 'b']].values
return s
def change2(s):
s[['b', 'c']] = s[['c', 'b']].values
return s
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
print('original:')
print(df)
df = df.apply(change1, axis = 1)
print('change1:')
print(df)
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df.loc[df['a']==1,:] = df.loc[df['a']==1,:].apply(change2, axis=1)
print('change2:')
print(df)
My questions are:
- Why does the second strategy (change2) not work, while the first one does?
- What would be a more correct, and faster, way to do this?
CodePudding user response:
Found a better solution:
df = df.where(df['a'] != 1, change2, axis=1)
That one was fast enough. Case closed..
CodePudding user response:
Why not:
df.loc[df['a']==1, ['b','c']] = df.loc[df['a']==1,['c','b']].values
change2
doesn't work because df.loc[df['a']==1,:]
is a slice of df based on a df['a']==1
returned as a view, so when you pull ['b','c']
from this slice, you get a copy, so assignment will have no effect on the original df.