Revert string to initial casing and punctuation

Is there a way I can modify this code to keep its logic but revert the strings to initial casing and punctuation?

data = {'duplicate_column':["Adidas Women's Womens A004 Snow Boot", 'Amul Milk, 100ml, 100ML', 'L-OCCITANE L´Occitane CREMA MANI', 'Corneto Ice Cream Ice, 300 ml -300ml', 'Béaba BÉABA, Set di 6 Contenitori,set']}
df = pd.DataFrame(data)
punct = '!"#$%&\'()* ,-./:;<=>?@[\\]^_`{}~´'
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['new_column'] = [
' '.join(dict.fromkeys(s.translate(transtab).lower().split()))
for s in df['duplicate']

This code is removing duplicates from column 'duplicate' and creates a new column with the result. Would need to revert the strings to initial casing and punctuation.

Duplicate column(initial data):

Adidas Women's Womens A004 Snow Boot
Amul Milk, 100ml, 100ML
L-OCCITANE L´Occitane CREMA MANI
Corneto Ice Cream Ice, 300 ml -300ml
Béaba BÉABA, Set di 6 Contenitori,set

New column created with the code above:

adidas womens a004 snow boot
amul milk 100ml
loccitane crema mani
corneto ice cream 300 ml 300ml
béaba set di 6 contenitori set

Desired output:

Adidas Women's A004 Snow Boot
Amul Milk, 100ml
L-OCCITANE CREMA MANI
Corneto Ice Cream, 300 ml
Béaba, Set di 6 Contenitori

CodePudding user response：

Another version:

import re

remove_punct = re.compile("""[!"#$%&'()* -./:;<=>?@[\\]^_`{}~´]""")
millilitres = re.compile(r"(\d )\s (ml)", flags=re.I)


def remove_duplicates(x):
    # do some basic preprocess
    x = x.replace(",", " ")
    x = millilitres.sub(r"\1\2", x)

    words = x.split()
    words_without_punct = remove_punct.sub("", x).lower().split()
    dupl, out = set(), []
    for w, wwp in zip(words, words_without_punct):
        if wwp not in dupl:
            out.append(w)
            dupl.add(wwp)
    return " ".join(out)


df["new_column"] = df["duplicate_column"].apply(remove_duplicates)
print(df)

Prints:

                        duplicate_column                     new_column
0   Adidas Women's Womens A004 Snow Boot  Adidas Women's A004 Snow Boot
1                Amul Milk, 100ml, 100ML                Amul Milk 100ml
2       L-OCCITANE L´Occitane CREMA MANI          L-OCCITANE CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml        Corneto Ice Cream 300ml
4  Béaba BÉABA, Set di 6 Contenitori,set     Béaba Set di 6 Contenitori

CodePudding user response：

Rather than fixing the output to restore. punctuation/case don't drop it in the first place. You could use a custom function based on sets:

import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
    seen = set()
    keep = []
    for w in s.split():
        w2 = regex.sub('', w.lower())
        if w2 in seen:
            continue 
        seen.add(w2)
        keep.append(w)
    return ' '.join(keep).strip(punct)
        
df['new_column'] = list(map(remove_dup, df['duplicate_column']))

Output:

                        duplicate_column                       new_column
0   Adidas Women's Womens A004 Snow Boot    Adidas Women's A004 Snow Boot
1                Amul Milk, 100ml, 100ML                 Amul Milk, 100ml
2       L-OCCITANE L´Occitane CREMA MANI            L-OCCITANE CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml  Corneto Ice Cream 300 ml -300ml
4  Béaba BÉABA, Set di 6 Contenitori,set   Béaba Set di 6 Contenitori,set

alternative

import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
    seen = set()
    keep = []
    for w in s.split():
        w2 = regex.sub('', w.lower())
        if w2 in seen:
            continue 
        seen.add(w2)
        keep.append(w)
    return ' '.join(keep).strip(punct)
        
df['new_column'] = list(map(remove_dup, df['duplicate_column']))

Output:

                        duplicate_column                      new_column
0   Adidas Women's Womens A004 Snow Boot  Adidas Women's  A004 Snow Boot
1                Amul Milk, 100ml, 100ML                 Amul Milk,100ml
2       L-OCCITANE L´Occitane CREMA MANI        L-OCCITANE L´ CREMA MANI
3   Corneto Ice Cream Ice, 300 ml -300ml       Corneto Ice Cream ,300 ml
4  Béaba BÉABA, Set di 6 Contenitori,set     Béaba ,Set di 6 Contenitori