I have a very large list of strings which each words in the list are unnormalized, for instance:
word_list = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"] # and the list goes on
As you can see,there are many identical terms in the list but some of them contain unnecessary puntuations (e.g.: dot, single apostrophe), how can I make all the words normalize (e.g.: "Alzheimers." -> "Alzheimers", "Cognition's" -> "Cognition") ?
Thank you in advance!
I expect a function that to filter out unnecessary single punctuations, I tried the following function but it did not work well:
def word_normalizer(word): # Remove unnecessary single puntuations and turn all words into lower case
puntuations = ["'", '"', ";", ":", ",", ".", "&", "(", ")"]
new_word =""
for punc in puntuations:
if punc in word:
new_word = word.strip(punc)
return new_word.lower()
CodePudding user response:
You can try to use this function, I tried this, it removes all the special characters (All non letters/digits) and punctuations and gets the statement to lower case letters
import re
def word_normalizer(word):
word = re.sub('[^A-Za-z0-9] ', '', word)
return word.lower()
CodePudding user response:
The standard string module has a useful punctuation value (which may or may not be suitable for your purposes).
You could conveniently use the re module to handle the replacements.
The following code removes trailing 's' which may not be desirable for all cases - i.e., just because a word ends with 's' doesn't necessarily mean that it's plural
Some replacements will result in duplication so use a set.
import string
import re
punc = re.compile(f'[{string.punctuation}]')
word_list = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]
new_word_set = {punc.sub('', word).rstrip('s') for word in word_list}
print(new_word_set)
Output:
{'Alzheimer', 'Cognition'}
CodePudding user response:
You can use functional techniques to do this pretty concisely
from functools import reduce
punctuation = ["'", '"', ";", ":", ",", ".", "&", "(", ")"]
words = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]
# remove a character from a string
def strip_punc(word: str, character_to_strip: str) -> str:
return word.replace(character_to_strip, "")
# run the strip_punc function for each item in the punctuation list
def clean_word(word: str): list(str) -> list(str):
return reduce(strip_punc, punctuation, word)
# run the clean_word function on each word in the word list
# use set to remove dupes
return set(map(clean_word, words))
You can use lambda functions to make this even more concise
from functools import reduce
punctuation = ["'", '"', ";", ":", ",", ".", "&", "(", ")"]
words = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]
return set(
map(
lambda p: reduce(lambda p, w: p.replace(w, ""), punctuation, p), words
)
)
CodePudding user response:
You can also use str.translate
and str.maketrans
with punctuation
from string
module to remove punctuation:
word_list = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]
def word_normalizer(word):
from string import punctuation
return word.translate(str.maketrans('', '', punctuation)).rstrip('s').lower()
new_word_list = [*{word_normalizer(word) for word in word_list}]
print(*new_word_list)
# alzheimer cognition