Pandas - why looping with str.extract is faster than one str.extractall-CodePudding

I have a dataframe (~100k rows) with strings that I need to extract multiple items from and create new columns for.

Sample Data

import pandas as pd
import re

s = pd.Series(['param1=1&param2=&param3=ena&param4=n2oi3-284&',
            'param1=2&param2=2iot&param3=&param4=&',
            'param1=3&param2=afv&param3=39&param4=4obgg942n&',
            'param1=4&param2=&param3=1291&param4=0g2n48a&'])

I can use the regex re.compile(r"=(.*?)&)"with str.extractall, then unstack the resulting dataframe, select, and append the columns I want.

match	0	1	2	3
0	1	NaN	ena	n2oi3-284
1	2	2iot	NaN	NaN
2	3	afv	39	4obgg942n
3	4	NaN	1291	0g2n48a

But when I tested it, it is slower than creating unique regexes in a dictionary for each parameter, e.g., r"1=(.*?)&", then looping through that dictionary and using column assignment for each regex.

params = {'param1': re.compile(r"1=(.*?)&"),
        'param2': re.compile(r"2=(.*?)&"),
        'param3': re.compile(r"3=(.*?)&"),
        'param4': re.compile(r"4=(.*?)&")}

for k, rx in params.items():
    df[k] = s.str.extract(rx, expand=False)

When I used %%timeit, it appears that looping through the dictionary of regexes and creating a new column for each is quicker than using str.extractall (disregarding any sort of column assignment / resulting dataframe manipulation).

rx = re.compile(r"=(.*?)&")

%%timeit -n 100
s.str.extractall(rx)

2.56 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
for rx in params.values():
    s.str.extract(rx, expand=False)

791 µs ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Why is this? Am I incorrectly timing the functions / comparing different things? Shouldn't one pass over the column be quicker than iterating over the column 4 times?

Documentation for str.extract and str.extractall doesn't say anything about this. Looking at the source code for extractall versus extract, I can't identify why one is quicker than the other.

Thanks!

CodePudding user response：

According to the source code extractall works with lists and appends to these which can be a slow if used frequently.

extractall tries to stay dynamic since it will capture all possible matches and therefore also iterates through the whole string, extract will return on the first match. If you only care about the first match you can also just use extract maybe like this:

s.str.extract(re.compile(r"param1=(.*?)&*param2=(.*?)&*param3=(.*?)&*param4=(.*?)&"))

So maybe extract is better suited for your use case afterall?