Is there a way I can modify this code to keep its logic but revert the strings to initial casing and punctuation?
data = {'duplicate_column':["Adidas Women's Womens A004 Snow Boot", 'Amul Milk, 100ml, 100ML', 'L-OCCITANE L´Occitane CREMA MANI', 'Corneto Ice Cream Ice, 300 ml -300ml', 'Béaba BÉABA, Set di 6 Contenitori,set']}
df = pd.DataFrame(data)
punct = '!"#$%&\'()* ,-./:;<=>?@[\\]^_`{}~´'
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['new_column'] = [
' '.join(dict.fromkeys(s.translate(transtab).lower().split()))
for s in df['duplicate']
This code is removing duplicates from column 'duplicate' and creates a new column with the result. Would need to revert the strings to initial casing and punctuation.
Duplicate column(initial data):
Adidas Women's Womens A004 Snow Boot
Amul Milk, 100ml, 100ML
L-OCCITANE L´Occitane CREMA MANI
Corneto Ice Cream Ice, 300 ml -300ml
Béaba BÉABA, Set di 6 Contenitori,set
New column created with the code above:
adidas womens a004 snow boot
amul milk 100ml
loccitane crema mani
corneto ice cream 300 ml 300ml
béaba set di 6 contenitori set
Desired output:
Adidas Women's A004 Snow Boot
Amul Milk, 100ml
L-OCCITANE CREMA MANI
Corneto Ice Cream, 300 ml
Béaba, Set di 6 Contenitori
CodePudding user response:
Another version:
import re
remove_punct = re.compile("""[!"#$%&'()* -./:;<=>?@[\\]^_`{}~´]""")
millilitres = re.compile(r"(\d )\s (ml)", flags=re.I)
def remove_duplicates(x):
# do some basic preprocess
x = x.replace(",", " ")
x = millilitres.sub(r"\1\2", x)
words = x.split()
words_without_punct = remove_punct.sub("", x).lower().split()
dupl, out = set(), []
for w, wwp in zip(words, words_without_punct):
if wwp not in dupl:
out.append(w)
dupl.add(wwp)
return " ".join(out)
df["new_column"] = df["duplicate_column"].apply(remove_duplicates)
print(df)
Prints:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk 100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream 300ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba Set di 6 Contenitori
CodePudding user response:
Rather than fixing the output to restore. punctuation/case don't drop it in the first place. You could use a custom function based on sets:
import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
seen = set()
keep = []
for w in s.split():
w2 = regex.sub('', w.lower())
if w2 in seen:
continue
seen.add(w2)
keep.append(w)
return ' '.join(keep).strip(punct)
df['new_column'] = list(map(remove_dup, df['duplicate_column']))
Output:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk, 100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream 300 ml -300ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba Set di 6 Contenitori,set
alternative
import re
regex = re.compile('[%s]' % re.escape(punct))
def remove_dup(s):
seen = set()
keep = []
for w in s.split():
w2 = regex.sub('', w.lower())
if w2 in seen:
continue
seen.add(w2)
keep.append(w)
return ' '.join(keep).strip(punct)
df['new_column'] = list(map(remove_dup, df['duplicate_column']))
Output:
duplicate_column new_column
0 Adidas Women's Womens A004 Snow Boot Adidas Women's A004 Snow Boot
1 Amul Milk, 100ml, 100ML Amul Milk,100ml
2 L-OCCITANE L´Occitane CREMA MANI L-OCCITANE L´ CREMA MANI
3 Corneto Ice Cream Ice, 300 ml -300ml Corneto Ice Cream ,300 ml
4 Béaba BÉABA, Set di 6 Contenitori,set Béaba ,Set di 6 Contenitori