I have a dataframe like this (assuming one column):
column
[A,C,B,A]
[HELLO,HELLO,ha]
[test/1, test/1, test2]
The type of the column above is: dtype('O')
I would like to remove the duplicates here, resulting in:
column
[A,C,B] # - A
[HELLO, ha] # removing 1 hello
[test/1, test2] # removing 1 test/1
Then, I would like to sort the data
column
[A,B,C]
[ha, HELLO]
[test2, test/1] # assuming that number comes before /
I am struggling getting this done in a proper way. Hope anyone has nice ideas (would it make sense to transform to small lists?)
CodePudding user response:
A set is a datatype which is like a list but it only includes unique values. This means that you can convert your list to a set and then back to a list (if you want a list of those unique values).
You would then write something like this:
list(set(name_of_the_list))
CodePudding user response:
Nested list inside of a dataframe is not a good habits to use, maybe for some really specific problems. I strongly encourage you to read how to stick to tidy paradigm (I know it's from a R point of view, but it can be easily transcript into python).
Nonetheless, using Numpy
functions as unique
and sort
can do the job for your example, assuming df
is your dataframe :
Using a for loop on your dataframe
df = pd.DataFrame({"col" :
[["A","C","B","A"],
["HELLO","HELLO","ha"],
["test/1", "test/1", "test2"]]})
for n, value in df.iterrows():
df.loc[n, "col"] = np.sort(np.unique(value.loc["col"]))
df
Out[77]:
col
0 [A, B, C]
1 [HELLO, ha]
2 [test/1, test2]
Using applymap
df2 = pd.DataFrame({"col" :
[["A","C","B","A"],
["HELLO","HELLO","ha"],
["test/1", "test/1", "test2"]]})
df2 = df2.applymap(lambda x : np.sort(np.unique(x)))
df2
Out[75]:
col
0 [A, B, C]
1 [HELLO, ha]
2 [test/1, test2]
CodePudding user response:
Assuming that you have lists in the column, use a list comprehension.
If you want to maintain order:
df['column_keep_order'] = [list(dict.fromkeys(x)) for x in df['column']]
If you want to sort the items:
df['column_sorted'] = [sorted(set(x)) for x in df['column']]
output:
column column_keep_order column_sorted
0 [A, C, B, A] [A, C, B] [A, B, C]
1 [HELLO, HELLO, ha] [HELLO, ha] [HELLO, ha]
2 [test/1, test/1, test2] [test/1, test2] [test/1, test2]
reproducible input:
df = pd.DataFrame({'column': [['A','C','B','A'],
['HELLO','HELLO','ha'],
['test/1', 'test/1', 'test2']]})
CodePudding user response:
Leverage set.
If already a list
df['column2']=[list(set(x)) for x in df.column.to_list()]
column column2
0 [A, C, B, A] [A, C, B]
1 [HELLO, HELLO, ha] [HELLO, ha]
2 [test/1, test/1, test2] [test2, test/1]
otherwise
df['column2']=df['column'].str.replace('\]|\[','',regex=True).str.split(',').map(set).map(list)
or
df['column2']=df['column'].str.replace('\]|\[','',regex=True).apply(lambda x:list(set(x.split(','))))
column column2
0 [A, C, B, A] [A, C, B]
1 [HELLO, HELLO, ha] [HELLO, ha]
2 [test/1, test/1, test2] [test2, test/1]