How can I correct vowels that were replaced by a special character in a string?-CodePudding

So I have a string, and within that string, certain characters in certain words are replaced with others (typo_text). For example: "USA, Germany, the European Commission, Japan, and Canada to fund the development and equitable rollout of the tests." This would be the correct format, but instead I'm given, "XSX, Gxrmxny, the European Commission, Jxpxn, and Cxnxdx to fund the development and equitable rollout of the tests, treatments and vaccines needed to end the acute phase of the COVID-19 pandemic. I have been creating a script that needs for loops to correct the typos:

def corrected_text(text):
    newstring=""
    for i in text:
        if i not in "aeiouAEIOU":
            newstring=newstring i
    text=newstring
    return text

I know when I run this, it only removes all vowels from the text. However, it seems to be a step in right direction to help correct the typos and to get a feel for a for loop-based approach.

I have two lists of words that have this issue:

name_G7_countries = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'UK', 'USA']
mistake =  ['Cxnxdx', 'Frxncx', 'Gxrmxny', 'Xtxly', 'Jxpxn','XK', 'XSX']

I know using something like 'Jxpxn'.replace('x', 'a') may work; however, for other phrases, it may not, so I'm not sure how to proceed from here.

CodePudding user response：

Create a lookup table, where the keys are the names of the countries with the vowels removed, and the values are the actual names of the countries.

Then, you can look up countries in this table by removing the Xs in the corrupted data.

name_G7_countries = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'UK', 'USA']
mistake =  ['Cxnxdx', 'Frxncx', 'Gxrmxny', 'Xtxly', 'Jxpxn','XK', 'XSX']

def remove_letters(s, letters_to_remove):
    return ''.join(filter(lambda x: x not in letters_to_remove, s))

country_lookup_table = {remove_letters(s, "AEIOUaeiou"): s 
    for s in name_G7_countries}

result = [country_lookup_table[remove_letters(mistake_country, 'Xx')] 
    for mistake_country in mistake]

# Prints ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'UK', 'USA']
print(result)

Note that we use ''.join() rather than repeated concatenation for the remove_letters() function for efficiency reasons.

CodePudding user response：

You can try doing that using regular expressions.
We want to run on the two lists, and then to replace each x with ^[aeiou], and X with ^[AEIOU] (which mean "match any of aeiou, or in the case of the capital X, AEIOU") for the regex, and after that we replace all words which match that pattern, with the original words.

import re

name_G7_countries = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'UK', 'USA']
mistake =  ['Cxnxdx', 'Frxncx', 'Gxrmxny', 'Xtxly', 'Jxpxn','XK', 'XSX']

def corrected_text(text):
    for original, m in zip(name_G7_countries, mistake):
        m = m.replace('x', '[aeiou]').replace('X', '[AEIOU]')
        text = re.sub(m, original, text)
    return text

So for example, we convert 'Xtxly' to '[AEIOU]t[aeiou]ly', so we match every word which starts with a capital vowel, a 't' after it, a lowercase vowel, and after that 'ly', and replace it with 'Italy'.