Applying string transformations from a file to a pandas df-CodePudding

I have a file.txt, whose content is something like:

5th => fifth
av, ave, avn => avenue
91st => ninety first
ny => new york
91st => ninety first
nrth => north
91st => ninety first
nrth => northwest

I have 1500 lines, approximately. There are repetitions, multiple transformations for the same word. I don't care much about which one we choose, as long as we choose them in a consistent way.

My data frame will have a column with strings. The objective is, for each string in that column, to transform the string using the information on the file above. For example:

"5th with 91st av nrth, seattle, ny,united states"

transforms into

"fifth with ninety first avenue north, seattle, new york,united states"

Here's a way to create mwe for the data frame:

 size = 60000
    df = pd.DataFrame({"address":
        [f"row: {i}  5th with 91st av nrth, seattle, ny,united states" for i in range(size)],
        "index":[i for i in range(size)]
    })

I've tried two solutions.

The first one:

def string_substitution(df:pd.Series):
        with ('file.txt').open() as f:
                file_substitutions = f.read().splitlines()
        word_regex = re.compile(r'\w ')
        string_list = []
        for row in df.index:
            string = df.loc[row]
            words = [match.group() for match in word_regex.finditer(string)]
            substitution_set = set()
            # looking the words in txt file
            for word in words:
                df_regex = re.compile(r'\b'   word   r"\b")
                substitution_regex = re.compile(r"(?==>(.*))")
                for line in file_substitutions:
                    if df_regex.search(line) is not None:
                        # print(f"line: {line} ------------------ \n")
                        
                        substitution_string = substitution_regex.findall(line)
                        if substitution_string != []:
                            substitution_string = substitution_string[0]
                        else:
                            # line from file_substitutions is a comment
                            # so we break
                            break
                        # print(f"word: {word}, sub: {substitution_string} \n")
                        substitution_string = substitution_string.lstrip()
                        substitution_set.add((word,substitution_string))
                        # with this break we stop on the first match
                        break
            # print(substitution_set)
            # print(string)
            for word,substitution in substitution_set:
                df_regex = re.compile(r'\b'   word   r"\b")
                string = re.sub(df_regex, repl= substitution,string=string)
            string_list.append(string)
        return string_list

This function would be called like: df["address"] = string_substitution(df["address"]).

This will take more than a 1m for a 60,000 row dataframe.

In my second solution, I tried to divide the dataframe into smaller subsets, and pass them to string_substitution, using:

with ('file.txt').open() as f:
    file_substitutions = f.read().splitlines() # we only open once the file
buckets = [
            df.iloc[
                pos:pos   self.bucket_size,
                df.columns.get_loc(self.target_column)
            ] for pos in range(0,df.shape[0], self.bucket_size)
        ]
        df_list = []
        with ThreadPoolExecutor() as pool:
            for results in pool.map(self._synonym_substitution_3, buckets):
                df_list.append(results)
        df[self.target_column] = pd.concat(df_list,ignore_index = True)

This is even worse...

My objective is to have a solution that runs for the example data frame within a few seconds (less than 10 seconds if possible), and not in 1m as it currently does...

CodePudding user response：

Here is a regex solution, runs in ~800ms for 60k rows and 7 replacement values:

words = pd.read_csv('file.txt', sep=r'\s*=>\s*',
                    engine='python', names=['word', 'repl'])

mapper = (words
   .assign(word=words['word'].str.split(r',\s*'))
   .explode('word')
   .drop_duplicates('word')
   .set_index('word')['repl']
)

import re

regex = '|'.join(map(re.escape, mapper.index))
# '5th|av|ave|avn|91st|ny|nrth'

df['address'] = df['address'].str.replace(regex, lambda m: mapper.get(m.group()), regex=True)

output:

                                             address  index
0  row: 0  fifth with ninety first avenue north, ...      0
1  row: 1  fifth with ninety first avenue north, ...      1
2  row: 2  fifth with ninety first avenue north, ...      2
3  row: 3  fifth with ninety first avenue north, ...      3
...