I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns. I found a solution which seems a litte slow. Is there a more sophisticated way?
column_A | column_B | column_C |
---|---|---|
Test • test abc | winter • sun | snow rain blank |
blabla • summer abc | break • Data | test letter • stop. |
So far I created a solution which is slow:
print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())
CodePudding user response:
You can use list comprehension
and re.search
. You can reduce 938 µs
to 26.7 µs
. (make sure don't create list
and use generator
)
res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
for col in ['column_A', 'column_B','column_C'])
print(res)
# 5
Benchmark:
%%timeit
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#
CodePudding user response:
The str.count
should be able to apply to the whole dataframe without hard coding this way. Try
sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
I have tried with 1000 * 1000 dataframes. Here is a benchmark for your reference.
%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)