Home > database >  Pandas - why looping with str.extract is faster than one str.extractall
Pandas - why looping with str.extract is faster than one str.extractall

Time:12-08

I have a dataframe (~100k rows) with strings that I need to extract multiple items from and create new columns for.

Sample Data

import pandas as pd
import re

s = pd.Series(['param1=1&param2=&param3=ena&param4=n2oi3-284&',
            'param1=2&param2=2iot&param3=&param4=&',
            'param1=3&param2=afv&param3=39&param4=4obgg942n&',
            'param1=4&param2=&param3=1291&param4=0g2n48a&'])

I can use the regex re.compile(r"=(.*?)&)"with str.extractall, then unstack the resulting dataframe, select, and append the columns I want.

match 0 1 2 3
0 1 NaN ena n2oi3-284
1 2 2iot NaN NaN
2 3 afv 39 4obgg942n
3 4 NaN 1291 0g2n48a

But when I tested it, it is slower than creating unique regexes in a dictionary for each parameter, e.g., r"1=(.*?)&", then looping through that dictionary and using column assignment for each regex.

params = {'param1': re.compile(r"1=(.*?)&"),
        'param2': re.compile(r"2=(.*?)&"),
        'param3': re.compile(r"3=(.*?)&"),
        'param4': re.compile(r"4=(.*?)&")}

for k, rx in params.items():
    df[k] = s.str.extract(rx, expand=False)

When I used %%timeit, it appears that looping through the dictionary of regexes and creating a new column for each is quicker than using str.extractall (disregarding any sort of column assignment / resulting dataframe manipulation).

rx = re.compile(r"=(.*?)&")

%%timeit -n 100
s.str.extractall(rx)

2.56 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
for rx in params.values():
    s.str.extract(rx, expand=False)

791 µs ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Why is this? Am I incorrectly timing the functions / comparing different things? Shouldn't one pass over the column be quicker than iterating over the column 4 times?

Documentation for str.extract and str.extractall doesn't say anything about this. Looking at the source code for extractall versus extract, I can't identify why one is quicker than the other.

Thanks!

CodePudding user response:

According to the source code extractall works with lists and appends to these which can be a slow if used frequently.

extractall tries to stay dynamic since it will capture all possible matches and therefore also iterates through the whole string, extract will return on the first match. If you only care about the first match you can also just use extract maybe like this:

s.str.extract(re.compile(r"param1=(.*?)&*param2=(.*?)&*param3=(.*?)&*param4=(.*?)&"))

So maybe extract is better suited for your use case afterall?

  • Related