Home > Net >  Polars apply performance for custom functions
Polars apply performance for custom functions

Time:12-10

I've enjoyed with Polars significant speed-ups over Pandas, except one case. I'm newbie to Polars, so it could be just my wrong usage. Anyway here is the toy-example: on single column I need to apply custom function in my case it is parse from probablypeople library (https://github.com/datamade/probablepeople) but problem is generic.

Plain pandas apply has similar runtime like Polars, but pandas with parallel_apply from (https://github.com/nalepae/pandarallel) gets speed-up proportional to number of cores.

It looks for me that Polars uses only single core for custom functions,or I miss something?

If I use Polars correctly, maybe there is a possibility to create tool like pandaralell for Polars?

!pip install probablepeople
!pip install pandarallel

import pandas as pd
import probablepeople as pp
import polars as pl
from pandarallel import pandarallel

AMOUNT = 1000_000
#Pandas:
df = pd.DataFrame({'a': ["Mr. Joe Smith"]})
df = df.loc[df.index.repeat(AMOUNT)].reset_index(drop=True)

df['b'] = df['a'].apply(pp.parse)

#Pandarallel:
pandarallel.initialize(progress_bar=True)
df['b_multi'] = df['a'].parallel_apply(pp.parse)

#Polars:
dfp = pl.DataFrame({'a': ["Mr. Joe Smith"]})
dfp = dfp.select(pl.all().repeat_by(AMOUNT).explode())

dfp = dfp.with_columns(pl.col('a').apply(pp.parse).alias('b'))


CodePudding user response:

It looks like pandarallel uses multiprocessing.

joblib's Parallel() and delayed() simplify multiprocessing and provide progress output via verbose=N

n_jobs=-1 uses all cores.

import polars as pl
from   joblib import Parallel, delayed

AMOUNT = 1_000_000

dfp = pl.DataFrame({'a': ["Mr. Joe Smith"]})
dfp = dfp.select(pl.all().repeat_by(AMOUNT).explode())

dfp = dfp.with_column(
   pl.Series(
      Parallel(n_jobs=-1, verbose=9)(
         delayed(pp.parse)(name) for name in dfp.get_column("a"))
   )
   .alias("b")
)

Without joblib - your example runs in ~5minutes on my machine. With joblib ~1m10s

A function could look like:

def parallel_apply(func, args):
   executor = Parallel(n_jobs=-1, verbose=9)
   return pl.Series(
      executor(delayed(func)(arg) for arg in args)
   )

dfp = dfp.with_column(
   parallel_apply(pp.parse, dfp.get_column("a"))
   .alias("b")
)
  • Related