Specifying word boundaries for multiple string replacement with regex?-CodePudding

I'm trying to mask city names in a list of texts using 'PAddress' tags. To do this, I borrowed thejonny's solution here for how to perform multiple regex substitutions using a dictionary with regex expressions as keys. In my implementation, the cities are keys and the values are tags that correspond to the exact format of the keys (this is important because the format must be preserved down the line). Eg., {East-Barrington: PAddress-PAddress}, so East-Barrington would be replaced by PAddress-PAddress ; one tag per word with punctuation and spacing preserved. Below is my code - sub_mult_regex() is the helper function called by mask_multiword_cities().

def sub_mult_regex(text, keys, tag_type):
    '''
    Replaces/masks multiple words at once
    Parameters:
        Text: TIU note
        Keys: a list of words to be replaced by the regex
        Tag_type: string you want the words to be replaced with
    Creates a replacement dictionary of keys and values 
    (values are the length of the key, preserving formatting).
    Eg., {68 Oak St., PAddress PAddress PAddress.,}
    Returns text with relevant text masked
    '''
    # Creating a list of values to correspond with keys (see key:value example in docstring)

    add_vals = []
    for val in keys:
        add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags

    # Zipping keys and values together as dictionary
    add_dict = dict(zip(keys, add_vals))

    # Compiling the keys together (regex)
    add_subs = re.compile("|".join("(" key ")" for key in add_dict), re.IGNORECASE)

    # This is where the multiple substitutions are happening
    # Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
    group_index = 1
    indexed_subs = {}
    for target, sub in add_dict.items():
        indexed_subs[group_index] = sub
        group_index  = re.compile(target).groups   1
    if len(indexed_subs) > 0:
        text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked 
    else:
        text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise

    # Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)

    case_a = text
    case_b = text_sub

    diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
    diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]

    return text_sub, diff_list 

 

def mask_multiword_cities(text_string):
    multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
    return sub_mult_regex(text_string, multi_word_cities, "PAddress")

The problem is, the keys in the regex dictionary don't have word boundaries specified, so while only exact matches should be tagged (case insensitive), phrases like 'around others' gets tagged because it thinks that the city 'Round O' is in it (technically that is a substring within that). Take this example text, run through the mask_multiword_cities function:

add_string = "The cities are Round O , NJ and around others"

mask_multiword_cities(add_string)

#(output): ('The cities are PAddress PAddress NJ , and aPAddress PAddressthers', [' Round', ' O', ' around', ' others'])

The output should only be ('The cities are PAddress PAddress NJ , and around others', [' Round', ' O']). I've tried converting each key to a regex expression like r"\b(?=\w)key\b(?!\w)" at various points in the sub_mult_regex function (lines 26 and 37) but that didn't work as expected.

For testing, assume that: us_cities_all = ['Great Barrington', 'Round O', 'East Orange'].

Also, if anyone can help make this run faster/be more efficient, that would be great! Right now, it takes about 30 seconds to run on a 1000-word note, likely because us_cities_all contains 5,000 cities. Let me know if it would be more helpful to directly post the cities list, I wasn't sure how to do so.

CodePudding user response：

you can partially extract the words and combine them later. I have added the example code based on your cases. it will fail if your add_string has no space btw words.

example code:

import re


# replace the string
def replacer(string, noise_list):
    for v in noise_list:
        string = string.replace(v, "PAddress")
    return string


def multi_mask(multi_word_cities, add_string):
    for city in multi_word_cities:
        if city in add_string:
            city_data = city.split()
            add_string_split = add_string.split()
            matched_city_data = [i for i in add_string_split if any((j == i) for j in city_data)]
            city_index = add_string_split.index(matched_city_data[1])
            new_string = ' '.join(add_string_split[:city_index   1])
            replaced_data = replacer(new_string, matched_city_data)
            capital_string = ''.join(re.findall(r'[A-Z]{2}', add_string))
            index_of_and = add_string_split.index("and")
            text_after_and = ' '.join(add_string_split[index_of_and:])
            return replaced_data   ' '   capital_string, text_after_and, matched_city_data


us_cities_all = ['Great Barrington', 'Round O', 'East Orange']
multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(
    city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
add_string = "The hospital is in East Orange and around o"

print(multi_mask(multi_word_cities, add_string))

>>> ('The hospital is in PAddress PAddress ', 'and around o', ['East', 'Orange'])

CodePudding user response：

I figured out a word-boundary based solution that would handle multiple cities, in case anyone might find it helpful in a similar situation:

def sub_mult_regex(text, keys, tag_type, city):
    '''
    Replaces/masks multiple words at once
    Parameters:
        text: TIU note
        keys: a list of words to be replaced by the regex
        tag_type: string you want the words to be replaced with
        city: bool, True if replacing cities, False if replacing anything else

    Creates a replacement dictionary of keys and values 
    (values are the length of the key, preserving formatting).

    Eg., {68 Oak St, PAddress PAddress PAddress}

    Returns text with relevant text masked
    '''

    # Creating a list of values to correspond with keys (see key:value example in docstring)

    if city:
        # If we're masking a city, handle word boundaries
        # This step of only including keys if they show up in the text speeds the code up by a lot, since it's not cross-referencing against thousands of cities, only the ones present
        keys = [r"\b" key r"\b" for key in keys if key in text or key.upper() in text] # add word boundaries for each key in list
        add_vals = []
        for val in keys:
            # Create dictionary of city word:PAddress by splitting the city on the '\\b' char that remains and then adding one tag per word
            # Ex: '\\bDeer Island\\b' --> split('\\b') --> ['', 'Deer Island', ''] --> ''.join --> (key) Deer Island : (value) PAddress PAddress
            add_vals.append(re.sub(r'\w{1,100}', tag_type, ''.join(val.split('\\b')))) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
        add_vals = [re.sub(r'\\b', "", val) for val in add_vals]

    elif not city:
        # If we're not masking a city, we don't do the word boundary step
        add_vals = []
        for val in keys:
            add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags

    # Zipping keys and values together as dictionary
    add_dict = dict(zip(keys, add_vals))
    print("add_dict:", add_dict)

    # Compiling the keys together (regex)
    add_subs = re.compile("|".join("(" key ")" for key in add_dict), re.IGNORECASE)

    # This is where the multiple substitutions are happening
    # Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys

    group_index = 1
    indexed_subs = {}
    for target, sub in add_dict.items():
        indexed_subs[group_index] = sub
        group_index  = re.compile(target).groups   1
    if len(indexed_subs) > 0:
        text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked text

    else:
        text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise

    # Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)

    case_a = text
    case_b = text_sub

    diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
    diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]

 
    return text_sub, diff_list

# sample call:
add_string = 'The cities are Round O NJ, around others and East Orange'
mask_multiword_cities(add_string) # this function remained the same 

# output: add_dict: {'\\bEast Orange\\b': 'PAddress PAddress', '\\bRound O\\b': 'PAddress PAddress'} ('The cities are PAddress PAddress NJ, around others are PAddress PAddress', [' Round', ' O', ' East', ' Orange'])