I have a file.txt
, whose content is something like:
5th => fifth
av, ave, avn => avenue
91st => ninety first
ny => new york
91st => ninety first
nrth => north
91st => ninety first
nrth => northwest
I have 1500 lines, approximately. There are repetitions, multiple transformations for the same word. I don't care much about which one we choose, as long as we choose them in a consistent way.
My data frame will have a column with strings. The objective is, for each string in that column, to transform the string using the information on the file above. For example:
"5th with 91st av nrth, seattle, ny,united states"
transforms into
"fifth with ninety first avenue north, seattle, new york,united states"
Here's a way to create mwe for the data frame:
size = 60000
df = pd.DataFrame({"address":
[f"row: {i} 5th with 91st av nrth, seattle, ny,united states" for i in range(size)],
"index":[i for i in range(size)]
})
I've tried two solutions.
The first one:
def string_substitution(df:pd.Series):
with ('file.txt').open() as f:
file_substitutions = f.read().splitlines()
word_regex = re.compile(r'\w ')
string_list = []
for row in df.index:
string = df.loc[row]
words = [match.group() for match in word_regex.finditer(string)]
substitution_set = set()
# looking the words in txt file
for word in words:
df_regex = re.compile(r'\b' word r"\b")
substitution_regex = re.compile(r"(?==>(.*))")
for line in file_substitutions:
if df_regex.search(line) is not None:
# print(f"line: {line} ------------------ \n")
substitution_string = substitution_regex.findall(line)
if substitution_string != []:
substitution_string = substitution_string[0]
else:
# line from file_substitutions is a comment
# so we break
break
# print(f"word: {word}, sub: {substitution_string} \n")
substitution_string = substitution_string.lstrip()
substitution_set.add((word,substitution_string))
# with this break we stop on the first match
break
# print(substitution_set)
# print(string)
for word,substitution in substitution_set:
df_regex = re.compile(r'\b' word r"\b")
string = re.sub(df_regex, repl= substitution,string=string)
string_list.append(string)
return string_list
This function would be called like: df["address"] = string_substitution(df["address"])
.
This will take more than a 1m for a 60,000 row dataframe.
In my second solution, I tried to divide the dataframe into smaller subsets, and pass them to string_substitution
, using:
with ('file.txt').open() as f:
file_substitutions = f.read().splitlines() # we only open once the file
buckets = [
df.iloc[
pos:pos self.bucket_size,
df.columns.get_loc(self.target_column)
] for pos in range(0,df.shape[0], self.bucket_size)
]
df_list = []
with ThreadPoolExecutor() as pool:
for results in pool.map(self._synonym_substitution_3, buckets):
df_list.append(results)
df[self.target_column] = pd.concat(df_list,ignore_index = True)
This is even worse...
My objective is to have a solution that runs for the example data frame within a few seconds (less than 10 seconds if possible), and not in 1m as it currently does...
CodePudding user response:
Here is a regex solution, runs in ~800ms for 60k rows and 7 replacement values:
words = pd.read_csv('file.txt', sep=r'\s*=>\s*',
engine='python', names=['word', 'repl'])
mapper = (words
.assign(word=words['word'].str.split(r',\s*'))
.explode('word')
.drop_duplicates('word')
.set_index('word')['repl']
)
import re
regex = '|'.join(map(re.escape, mapper.index))
# '5th|av|ave|avn|91st|ny|nrth'
df['address'] = df['address'].str.replace(regex, lambda m: mapper.get(m.group()), regex=True)
output:
address index
0 row: 0 fifth with ninety first avenue north, ... 0
1 row: 1 fifth with ninety first avenue north, ... 1
2 row: 2 fifth with ninety first avenue north, ... 2
3 row: 3 fifth with ninety first avenue north, ... 3
...