How can I remove duplicate parts of a string, within a column and sort the values-CodePudding

I have a dataframe like this (assuming one column):

column
[A,C,B,A]
[HELLO,HELLO,ha]
[test/1, test/1, test2]

The type of the column above is: dtype('O')

I would like to remove the duplicates here, resulting in:

column
[A,C,B]                  # - A
[HELLO, ha]               # removing 1 hello
[test/1, test2]  # removing 1 test/1

Then, I would like to sort the data

column
[A,B,C]                  
[ha, HELLO]             
[test2, test/1]  # assuming that number comes before /

I am struggling getting this done in a proper way. Hope anyone has nice ideas (would it make sense to transform to small lists?)

CodePudding user response：

A set is a datatype which is like a list but it only includes unique values. This means that you can convert your list to a set and then back to a list (if you want a list of those unique values).

You would then write something like this:

list(set(name_of_the_list))

CodePudding user response：

Nested list inside of a dataframe is not a good habits to use, maybe for some really specific problems. I strongly encourage you to read how to stick to tidy paradigm (I know it's from a R point of view, but it can be easily transcript into python).

Nonetheless, using Numpy functions as unique and sort can do the job for your example, assuming df is your dataframe :

Using a for loop on your dataframe

df = pd.DataFrame({"col" : 
                   [["A","C","B","A"],
                    ["HELLO","HELLO","ha"],
                    ["test/1", "test/1", "test2"]]})
    
for n, value in df.iterrows():
    df.loc[n, "col"] = np.sort(np.unique(value.loc["col"]))

df
Out[77]: 
               col
0        [A, B, C]
1      [HELLO, ha]
2  [test/1, test2]

Using applymap

df2 = pd.DataFrame({"col" : 
                   [["A","C","B","A"],
                    ["HELLO","HELLO","ha"],
                    ["test/1", "test/1", "test2"]]})
    
df2 = df2.applymap(lambda x : np.sort(np.unique(x)))

df2
Out[75]: 
               col
0        [A, B, C]
1      [HELLO, ha]
2  [test/1, test2]

CodePudding user response：

Assuming that you have lists in the column, use a list comprehension.

If you want to maintain order:

df['column_keep_order'] = [list(dict.fromkeys(x)) for x in df['column']]

If you want to sort the items:

df['column_sorted'] = [sorted(set(x)) for x in df['column']]

output:

                    column column_keep_order    column_sorted
0             [A, C, B, A]         [A, C, B]        [A, B, C]
1       [HELLO, HELLO, ha]       [HELLO, ha]      [HELLO, ha]
2  [test/1, test/1, test2]   [test/1, test2]  [test/1, test2]

reproducible input:

df = pd.DataFrame({'column': [['A','C','B','A'],
                              ['HELLO','HELLO','ha'],
                              ['test/1', 'test/1', 'test2']]})

CodePudding user response：

Leverage set.

If already a list

df['column2']=[list(set(x)) for x in df.column.to_list()]

       

     column          column2
0             [A, C, B, A]        [A, C, B]
1       [HELLO, HELLO, ha]      [HELLO, ha]
2  [test/1, test/1, test2]  [test2, test/1]

otherwise

df['column2']=df['column'].str.replace('\]|\[','',regex=True).str.split(',').map(set).map(list)

df['column2']=df['column'].str.replace('\]|\[','',regex=True).apply(lambda x:list(set(x.split(','))))


       

                  column          column2
0             [A, C, B, A]        [A, C, B]
1       [HELLO, HELLO, ha]      [HELLO, ha]
2  [test/1, test/1, test2]  [test2, test/1]