I have a dataframe (~100k rows) with strings that I need to extract multiple items from and create new columns for.
Sample Data
import pandas as pd
import re
s = pd.Series(['param1=1¶m2=¶m3=ena¶m4=n2oi3-284&',
'param1=2¶m2=2iot¶m3=¶m4=&',
'param1=3¶m2=afv¶m3=39¶m4=4obgg942n&',
'param1=4¶m2=¶m3=1291¶m4=0g2n48a&'])
I can use the regex re.compile(r"=(.*?)&)"
with str.extractall
, then unstack the resulting dataframe, select, and append the columns I want.
match | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 1 | NaN | ena | n2oi3-284 |
1 | 2 | 2iot | NaN | NaN |
2 | 3 | afv | 39 | 4obgg942n |
3 | 4 | NaN | 1291 | 0g2n48a |
But when I tested it, it is slower than creating unique regexes in a dictionary for each parameter, e.g., r"1=(.*?)&"
, then looping through that dictionary and using column assignment for each regex.
params = {'param1': re.compile(r"1=(.*?)&"),
'param2': re.compile(r"2=(.*?)&"),
'param3': re.compile(r"3=(.*?)&"),
'param4': re.compile(r"4=(.*?)&")}
for k, rx in params.items():
df[k] = s.str.extract(rx, expand=False)
When I used %%timeit, it appears that looping through the dictionary of regexes and creating a new column for each is quicker than using str.extractall
(disregarding any sort of column assignment / resulting dataframe manipulation).
rx = re.compile(r"=(.*?)&")
%%timeit -n 100
s.str.extractall(rx)
2.56 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
for rx in params.values():
s.str.extract(rx, expand=False)
791 µs ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Why is this? Am I incorrectly timing the functions / comparing different things? Shouldn't one pass over the column be quicker than iterating over the column 4 times?
Documentation for str.extract and str.extractall doesn't say anything about this. Looking at the source code for extractall versus extract, I can't identify why one is quicker than the other.
Thanks!
CodePudding user response:
According to the source code extractall
works with lists and appends to these which can be a slow if used frequently.
extractall
tries to stay dynamic since it will capture all possible matches and therefore also iterates through the whole string, extract
will return on the first match. If you only care about the first match you can also just use extract
maybe like this:
s.str.extract(re.compile(r"param1=(.*?)&*param2=(.*?)&*param3=(.*?)&*param4=(.*?)&"))
So maybe extract
is better suited for your use case afterall?