I ran the following in a Jupyter Notebook and was disappointed that similar Pandas code is faster. Hoping someone can show a smarter approach in Polars.
POLARS VERSION
def cleanse_text(sentence):
RIGHT_QUOTE = r"(\u2019)"
sentence = re.sub(RIGHT_QUOTE, "'", sentence)
sentence = re.sub(r" ", " ", sentence)
return sentence.strip()
df = df.with_columns(pl.col("text").apply(lambda x: cleanse_text(x)).keep_name())
PANDAS VERSION
def cleanse_text(sentence):
RIGHT_QUOTE = r"(\u2019)"
sentence = re.sub(RIGHT_QUOTE, "'", sentence)
sentence = re.sub(r" ", " ", sentence)
return sentence.strip()
df["text"] = df["text"].apply(lambda x: cleanse_text(x))
The above Pandas version was 10% faster than the Polars version when I ran this on a dataframe with 750,000 rows of text.
CodePudding user response:
Instead of combining Series.apply
with re.sub
, you can chain 2 instances of Series.str.replace
in this case, and finally add Series.str.strip
. This will be faster generally (see end of answer as to why), but particularly for polars
.
Pandas version
import pandas as pd
t = "'Hello World\u2019 "
df = pd.DataFrame({'text': [t]*750000})
df['text'] = (df['text']
.str.replace('\u2019',"'", regex=True)
.str.replace(' ',' ', regex=True)
.str.strip())
df.head()
text
0 'Hello World'
1 'Hello World'
2 'Hello World'
3 'Hello World'
4 'Hello World'
Polars version
import polars as pl
t = "'Hello World\u2019 "
df_pl = pl.DataFrame({'text': [t]*750000})
df_pl = (df_pl
.with_column(pl.col('text')
.str.replace('\u2019',"'")
.str.replace(' ',' ')
.str.strip()))
df_pl.head()
┌───────────────┐
│ text │
│ --- │
│ str │
╞═══════════════╡
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
└───────────────┘
Performance comparison
Results of timeit
test for each method (dfs
checked for equality):
method timeit (s) perc
0 pandas_new 1.092429 1.000000
1 pandas_old 1.553892 1.422419
2 polars_new 0.151107 0.138322
3 polars_old 1.851840 1.695158
As you can see, both new methods for pandas
and polars
are faster than the original methods, and the polars
method is a clear winner, taking only 13.8% of the new pandas
method.
So, why is Series.str.replace
(or: str.strip
) so much faster than Series.apply
? The reason has to do with the fact that the former performs an operator on an entire Series (e.g. a "column") all at once ("vectorization"), while the latter calls a Python function for each element in the Series separately. E.g. lambda x: cleanse_text(x)
means: apply a UDF
(user-defined function) to 1st element in column, 2nd element in column, etc. On larger sets, this will make a huge difference. Cf. also the documentation for pl.DataFrame.apply
.