Replacing acronyms with their full forms in Python-CodePudding

I have an acronym dictionary that has keys as an acronym and values as full forms.

I want to replace the acronyms found in the text_list with the full forms to arrive at the ouput_list

acronym_dict = {
    'QUO': 'Quotation',
    'IN': 'India',
    'SW': 'Software',
    'RE': 'Regular Expression'
}

text_list = [
    'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
    'The update does not belong to the SW, version, branch',
    'This is a RE_Text'
]

output_list = [
    'The status Quotation has changed',
    'I SWEAR, This is part of India_Software',
    'The update does not belong to the Software, version, branch',
    'This is Regular Expression_Text'
]

I wrote a method to do that

import string
def remove_punctuations(text):
    punct_str = string.punctuation  # !"#$%&\'()* ,-./:;<=>?@[\\]^_`{|}~
    for punctuation in punct_str:
        text = text.replace(punctuation, ' ')
    return text.strip()

def replace_single_acronym(text, acronym, fullform):
    words = text.split()
    return_words = []
    for w in words:
        if remove_punctuations(w).lower() == acronym.lower():
            return_words.append(w.replace(acronym, fullform))
        else:
            return_words.append(w)
    return " ".join(return_words)

my_op_list = []
for text in text_list:
    for acronym in acronym_dict.keys():
        text = replace_single_acronym(text, acronym, acronym_dict[acronym])
    my_op_list.append(text)

Ideally output_list and my_op_list should look the same. It prints the below result (failing in 2 instances)

['The status Quotation has changed',
 'I SWEAR, This is part of IN_SW',
 'The update does not belong to the Software, version, branch',
 'This is a RE_Text']

Also, the method replace_single_acronym is very slow on a corpus of 1000 text_list items.

Can someone help me in adjusting the method to do it in the right and efficient way?

CodePudding user response：

You might use re.sub for this task by delivering function as 2nd argument following way

import re
acronym_dict = {
    'QUO': 'Quotation',
    'IN': 'India',
    'SW': 'Software',
    'RE': 'Regular Expression'
}

text_list = [
    'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
    'The update does not belong to the SW, version, branch',
    'This is a RE_Text'
]
def get_full_name(m):
    return acronym_dict.get(m.group(1),m.group(1))
def replace_acronyms(text):
    return re.sub(r'(?<![A-Z])([A-Z] )(?![A-Z])', get_full_name, text)
output_list = [replace_acronyms(i) for i in text_list]
print(output_list)

output:

['The status Quotation has changed', 'I SWEAR, This is part of India_Software', 'The update does not belong to the Software, version, branch', 'This is a Regular Expression_Text']

Explanation: in pattern I used there are two zero-length assertions and one capturing group, it does find one or more uppercase ASCII letters, which are not preceded by ASCII uppercase letter (negative lookbehind) and not followed by ASCII uppercase letter (negative lookahead). get_full_name is function used as 2nd argument of re.sub thus it do accept single argument, which is match. m.group(1) denote content of sole capturing group I have used in pattern, it is acronym, I used .get function of dict so if given acronym is present in dict keys then use corresponding value, if it is not just use that acronym i.e. do not change anything.