I have an acronym dictionary that has keys
as an acronym and values
as full forms.
I want to replace the acronyms found in the text_list
with the full forms to arrive at the ouput_list
acronym_dict = {
'QUO': 'Quotation',
'IN': 'India',
'SW': 'Software',
'RE': 'Regular Expression'
}
text_list = [
'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
'The update does not belong to the SW, version, branch',
'This is a RE_Text'
]
output_list = [
'The status Quotation has changed',
'I SWEAR, This is part of India_Software',
'The update does not belong to the Software, version, branch',
'This is Regular Expression_Text'
]
I wrote a method to do that
import string
def remove_punctuations(text):
punct_str = string.punctuation # !"#$%&\'()* ,-./:;<=>?@[\\]^_`{|}~
for punctuation in punct_str:
text = text.replace(punctuation, ' ')
return text.strip()
def replace_single_acronym(text, acronym, fullform):
words = text.split()
return_words = []
for w in words:
if remove_punctuations(w).lower() == acronym.lower():
return_words.append(w.replace(acronym, fullform))
else:
return_words.append(w)
return " ".join(return_words)
my_op_list = []
for text in text_list:
for acronym in acronym_dict.keys():
text = replace_single_acronym(text, acronym, acronym_dict[acronym])
my_op_list.append(text)
Ideally output_list
and my_op_list
should look the same. It prints the below result (failing in 2 instances)
['The status Quotation has changed',
'I SWEAR, This is part of IN_SW',
'The update does not belong to the Software, version, branch',
'This is a RE_Text']
Also, the method replace_single_acronym
is very slow on a corpus of 1000 text_list
items.
Can someone help me in adjusting the method to do it in the right and efficient way?
CodePudding user response:
You might use re.sub
for this task by delivering function as 2nd argument following way
import re
acronym_dict = {
'QUO': 'Quotation',
'IN': 'India',
'SW': 'Software',
'RE': 'Regular Expression'
}
text_list = [
'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
'The update does not belong to the SW, version, branch',
'This is a RE_Text'
]
def get_full_name(m):
return acronym_dict.get(m.group(1),m.group(1))
def replace_acronyms(text):
return re.sub(r'(?<![A-Z])([A-Z] )(?![A-Z])', get_full_name, text)
output_list = [replace_acronyms(i) for i in text_list]
print(output_list)
output:
['The status Quotation has changed', 'I SWEAR, This is part of India_Software', 'The update does not belong to the Software, version, branch', 'This is a Regular Expression_Text']
Explanation: in pattern I used there are two zero-length assertions and one capturing group, it does find one or more uppercase ASCII letters, which are not preceded by ASCII uppercase letter (negative lookbehind) and not followed by ASCII uppercase letter (negative lookahead). get_full_name
is function used as 2nd argument of re.sub
thus it do accept single argument, which is match. m.group(1)
denote content of sole capturing group I have used in pattern, it is acronym, I used .get
function of dict
so if given acronym is present in dict keys then use corresponding value, if it is not just use that acronym i.e. do not change anything.