I have:
A location list of about 40k bigram/trigram words.
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
A Pandas DataFrame with millions of rows.
string_column | string_column_location_removed |
---|---|
Burger King Oakland CA | Burger King |
Walmart Walnut Creek CA | Walmart |
I'm currently looping through the list of locations, if the location exists in the string_column
, create a new column string_column_location_removed
with the location removed.
Here is my attempt, while it works, it's very slow. Any ideas on how to speed this up?
I've tried taking ideas from this and this but not sure how to really extrapolate this using a Pandas Dataframe.
from random import choice
from string import ascii_lowercase, digits
import pandas
#making random list here
chars = ascii_lowercase digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df)
def location_remove(txnString):
for locationString in locations_lookup_list:
if re.search(f'\\b{locationString}\\b', txnString):
return re.sub(f'\\b{locationString}\\b','', txnString)
else:
continue
df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))
CodePudding user response:
Use trrex, it builds an equivalent pattern as the same found in this resource (actually it is inspired by that answer):
from random import choice
from string import ascii_lowercase, digits
import pandas as pd
import trrex as tx
# making random list here
chars = ascii_lowercase digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")
df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)
Output
string_column string_column_location_removed
0 Burger King Oakland CA Burger King
1 Walmart Walnut Creek CA Walmart
2 Random Other Thing Here Random Other Thing Here
3 Another random other thing here Another random other thing here
4 Really Appreciate the help on this Really Appreciate the help on this
... ... ...
1499995 Walmart Walnut Creek CA Walmart
1499996 Random Other Thing Here Random Other Thing Here
1499997 Another random other thing here Another random other thing here
1499998 Really Appreciate the help on this Really Appreciate the help on this
1499999 Thank you so Much! Thank you so Much!
[1500000 rows x 2 columns]
Timing (of on run of str.replace
)
%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The timing does not include the time needed for building the pattern.
DISCLAIMER I'm the author of trrex