Home > other >  Best way to execute multiple lines of pandas in parallel? (Speed up)
Best way to execute multiple lines of pandas in parallel? (Speed up)

Time:04-13

Basically, I am performing simple operation and updating 100 columns of my dataframe of size (550 rows and 2700 columns).

I am updating 100 columns like this:

df["col1"] = df["static"]-df["col1"])/df["col1"]*100
df["col2"] = (df["static"]-df["col2"])/df["col2"]*100
df["col3"] = (df["static"]-df["col3"])/df["col3"]*100
....
....
df["col100"] = (df["static"]-df["col100"])/df["col100"]*100

This operation is taking 170 ms in my original dataframe. I want to speed up the time. I am doing some real-time thing, so time is important.

CodePudding user response:

You can select all columns and subtract with right side by DataFrame.rsub with DataFrame.div only columns vby list cols`:

cols = [f'col{c}' for c in range(1, 101)]
df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)

Performance:

np.random.seed(2022)

df=pd.DataFrame(np.random.randint(1001, size=(550,2700))).add_prefix('col')
df = df.rename(columns={'col0':'static'})


In [58]: %%timeit
    ...: for i in range(1, 101):
    ...:     df[f"col{i}"] = (df["static"]-df[f"col{i}"])/df[f"col{i}"]*100
    ...:     
59.9 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [59]: %%timeit
    ...: cols = [f'col{c}' for c in range(1, 101)]
    ...: df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
    ...: 
11.9 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Related