Normalizing a list of strings Python-CodePudding

I have a very large list of strings which each words in the list are unnormalized, for instance:

word_list = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"] # and the list goes on

As you can see,there are many identical terms in the list but some of them contain unnecessary puntuations (e.g.: dot, single apostrophe), how can I make all the words normalize (e.g.: "Alzheimers." -> "Alzheimers", "Cognition's" -> "Cognition") ?

Thank you in advance!

I expect a function that to filter out unnecessary single punctuations, I tried the following function but it did not work well:

def word_normalizer(word): # Remove unnecessary single puntuations and turn all words into lower case
    puntuations = ["'", '"', ";", ":", ",", ".", "&", "(", ")"]
    new_word =""
    for punc in puntuations:
        if punc in word:
            new_word = word.strip(punc)
            
        return new_word.lower()

CodePudding user response：

You can try to use this function, I tried this, it removes all the special characters (All non letters/digits) and punctuations and gets the statement to lower case letters

import re
def word_normalizer(word): 
    word = re.sub('[^A-Za-z0-9] ', '', word)
    return word.lower()

CodePudding user response：

The standard string module has a useful punctuation value (which may or may not be suitable for your purposes).

You could conveniently use the re module to handle the replacements.

The following code removes trailing 's' which may not be desirable for all cases - i.e., just because a word ends with 's' doesn't necessarily mean that it's plural

Some replacements will result in duplication so use a set.

import string
import re

punc = re.compile(f'[{string.punctuation}]')
word_list = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]
new_word_set = {punc.sub('', word).rstrip('s') for word in word_list}

print(new_word_set)

Output:

{'Alzheimer', 'Cognition'}

CodePudding user response：

You can use functional techniques to do this pretty concisely

    from functools import reduce

  punctuation = ["'", '"', ";", ":", ",", ".", "&", "(", ")"]
  words =  ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]

  # remove a character from a string
  def strip_punc(word: str, character_to_strip: str) -> str:
    return word.replace(character_to_strip, "")

  # run the strip_punc function for each item in the punctuation list
  def clean_word(word: str): list(str) -> list(str):
    return reduce(strip_punc, punctuation, word)

  # run the clean_word function on each word in the word list
  # use set to remove dupes
  return set(map(clean_word, words))

You can use lambda functions to make this even more concise

  from functools import reduce

  punctuation = ["'", '"', ";", ":", ",", ".", "&", "(", ")"]
  words =  ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]

  return set(
    map(
     lambda p: reduce(lambda p, w: p.replace(w, ""), punctuation, p), words
    )
  )

CodePudding user response：

You can also use str.translate and str.maketrans with punctuation from string module to remove punctuation:

word_list = ["Alzheimer", "Alzheimer's", "Alzheimer.", "Alzheimer?","Cognition.", "Cognition's", "Cognitions", "Cognition"]

def word_normalizer(word):
    from string import punctuation
    return word.translate(str.maketrans('', '', punctuation)).rstrip('s').lower()


new_word_list = [*{word_normalizer(word) for word in word_list}]

print(*new_word_list)

# alzheimer cognition