Speeding up pandas multi row assignment with loc()-CodePudding

I am trying to assign value to a column for all rows selected based on a condition. Solutions for achieving this are discussed in several questions like this one. The standard solution are of the following syntax:

df.loc[row_mask, cols] = assigned_val

Unfortunately, this standard solution takes forever. In fact, in my case, I didn't manage to get even one assignment complete.

Update: More info about my dataframe: I have ~2 Million rows in my dataframe and I am trying to update the value of one column in my dataframe for rows that are selected based on a condition. On average, the selection condition is satisfied by ~10 rows.

Is it possible to speed up this assignment operation? Also, are there any general guidelines for multiple assignments with pandas in general.

CodePudding user response：

I believe .loc and .at are the differences you're looking for. .at is meant to be faster based on this answer.

CodePudding user response：

You could give np.where a try.

Here is an simple example of np.where

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['B'] = np.where(df['B']< 50, 100000, df['B'])

np.where() do nothing if condition fails has another example.

In your case, it might be

df[col] = np.where(df[col]==row_condition,assigned_val, df[col])

I was thinking it might be a little quicker because it is going straight to numpy instead of going through pandas to the underlying numpy mechanism. This article talks about Pandas vs Numpy on large datasets: https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7#:~:text=Numpy was faster than Pandas,exception of simple arithmetic operations.