I have multiple columns in which I merged all the values from previous columns containing strings and separated them with a comma. The only issue is that by doing so it apparently "kept" any duplicate values even though I removed the duplicates based on the column that the merge was done so for the value "5" on column Col_A the corresponding value on Col_B is James,Maria,Harrison.
Unfortunately the output looks something like: James,Maria,James,Harrison,Maria,Marina,Harrison which is what I've been trying to get rid of with various methods such as apply and split.
Here is my latest attempt:
Import pandas as pd
Import numpy as np
df1['Col_B'] = ','.join(set(df1['Col_B'].split(',')))
df1
Original output:
Index Col_A Col_B
0 5 Maria,Harrison,James,Harrison,Maria,Maria
1 2 Maria,Jimmy,Emma,Jim,Jim,Maria,Jimmy,Jimmy
2 46 ...
3 184 ...
4 31 ...
5 81 ...
Desired output:
Index Col_A Col_B
0 5 Maria,Harrison,James
1 2 Maria,Jimmy,Emma,Jim
2 46 ...
3 184 ...
4 31 ...
5 81 ...
Any suggestions would be much appreciated it. Looked for some time online but wasn't able to locate a solution for this exact application.
CodePudding user response:
Lots of ways to do this,
if the original order is of no consequence then
df['Col_B'].str.split(',').apply(lambda x : ','.join(set(x)))
0 Maria,Harrison,James,Harrison,Maria,Maria
1 Maria,Jimmy,Emma,Jim,Jim,Maria,Jimmy,Jimmy
2 ...
3 ...
4 ...
5 ...
or
df['Col_B'].str.split(',').apply(pd.unique)
0 [Maria, Harrison, James]
1 [Maria, Jimmy, Emma, Jim]
2 [...]
3 [...]
4 [...]
5 [...]