I have a list of strings with lots of unnecessary punctuation marks. how could I remove all possible punctuations and numbers from these strings in a more pythonic way?
medicinal_chemicals = ['(RS)-2-(4-(2-methylpropyl)phenyl)propanoic acid', 'emtricitabine / tenofovir (Stribild®)', 'Interleukin-2 (Aldesleukin)']
CodePudding user response:
You can use the following chunk of code for what you requested.
If you do not want to remove the Unicode characters, just skip that function
import string
pattern = r'[0-9]'
def removePunct(s):
return s.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation))).replace(' '*4, ' ').replace(' '*3, ' ').replace(' '*2, ' ').strip()
def removeDigits(s):
return re.sub(pattern, ' ', s)
def removeSpaceTrailsString(s):
return " ".join(s.split())
def clean_unicide(s):
s_encode = s.encode("ascii", "ignore")
return s_encode.decode()
medicinal_chemicals = ['(RS)-2-(4-(2-methylpropyl)phenyl)propanoic acid', 'emtricitabine / tenofovir (Stribild®)', 'Interleukin-2 (Aldesleukin)']
punctCleaned = list(map(removePunct, medicinal_chemicals))
digitsCleaned = list(map(removeDigits, punctCleaned))
spacesCleaned = list(map(removeSpaceTrailsString, digitsCleaned))
unicodeCleaned = list(map(clean_unicide, spacesCleaned))
print(unicodeCleaned)
['RS methylpropyl phenyl propanoic acid', 'emtricitabine tenofovir Stribild', 'Interleukin Aldesleukin']