Home > Software engineering >  Duplicate substring removal from list
Duplicate substring removal from list

Time:09-09

I have a dataframe with a product_type column that has duplicate substrings within strings:

df1

product_type
bag,bag
tote bag,bag

handbag,handbag

I'm using this line to remove to create a new column "unique_type" the duplicate substrings

df_1['unique_type'] = [set(sub.split(',')) for sub in df_1["product_type"]]

This is what the new dataframe looks like

current output

product_type         unique_type
bag,bag              {'bag'}
tote bag, bag        {'tote bag', 'bag'}
                     {''}
handbag, handbag     {'handbag'}

The problem is that the strings in the new column unique_type has curly brackets and quotation marks. I would like to produce a column that has strings without curly brackets and quotation marks like so:

desired output

product_type         unique_type
bag,bag              bag
tote bag, bag        tote bag, bag
                 
handbag, handbag     handbag

CodePudding user response:

Add join:

df_1['unique_type'] = [', '.join(set(sub.split(','))) for sub in df_1["product_type"]]

Or if need same order of values use dict.fromkeys trick:

df_1['unique_type1'] = [', '.join(dict.fromkeys(sub.split(',')))
                                                     for sub in df_1["product_type"]]


print (df_1)
      product_type    unique_type   unique_type1
0          bag,bag            bag            bag
1     tote bag,bag  bag, tote bag  tote bag, bag
2                                               
3  handbag,handbag        handbag        handbag
  • Related