Home > Blockchain >  Removing duplicate entries from a comma separated string column in pandas
Removing duplicate entries from a comma separated string column in pandas

Time:02-17

I have multiple columns in which I merged all the values from previous columns containing strings and separated them with a comma. The only issue is that by doing so it apparently "kept" any duplicate values even though I removed the duplicates based on the column that the merge was done so for the value "5" on column Col_A the corresponding value on Col_B is James,Maria,Harrison.

Unfortunately the output looks something like: James,Maria,James,Harrison,Maria,Marina,Harrison which is what I've been trying to get rid of with various methods such as apply and split.

Here is my latest attempt:

Import pandas as pd
Import numpy as np

df1['Col_B'] = ','.join(set(df1['Col_B'].split(',')))
df1

Original output:

  Index         Col_A                              Col_B
     0            5               Maria,Harrison,James,Harrison,Maria,Maria
     1            2               Maria,Jimmy,Emma,Jim,Jim,Maria,Jimmy,Jimmy
     2            46                                 ...
     3            184                                ...
     4            31                                 ...
     5            81                                 ...

Desired output:

  Index         Col_A        Col_B
     0            5    Maria,Harrison,James
     1            2    Maria,Jimmy,Emma,Jim
     2            46           ...
     3            184          ...
     4            31           ...
     5            81           ...

Any suggestions would be much appreciated it. Looked for some time online but wasn't able to locate a solution for this exact application.

CodePudding user response:

Lots of ways to do this,

if the original order is of no consequence then

df['Col_B'].str.split(',').apply(lambda x : ','.join(set(x)))

0     Maria,Harrison,James,Harrison,Maria,Maria
1    Maria,Jimmy,Emma,Jim,Jim,Maria,Jimmy,Jimmy
2                                           ...
3                                           ...
4                                           ...
5                                           ...

or

df['Col_B'].str.split(',').apply(pd.unique)

0     [Maria, Harrison, James]
1    [Maria, Jimmy, Emma, Jim]
2                        [...]
3                        [...]
4                        [...]
5                        [...]
  • Related