Hi i have a column of lists and i want to return the rows where the lists are identical, having the same order.
d = {'id':[1,2,3], 'lst' : [['GG','PP', 'DD'],['DD','PP', 'GG'], ['GG','PP', 'DD']]}
dd = pd.DataFrame(d)
print(dd)
id lst
0 1 [GG, PP, DD]
1 2 [DD, PP, GG]
2 3 [GG, PP, DD]
I do this but i get the wrong output
dd[dd.apply(lambda row: row.lst==row.lst, axis=1)]
id lst
0 1 [GG, PP, DD]
1 2 [DD, PP, GG]
2 3 [GG, PP, DD]
My desired output is this
id lst
0 1 [GG, PP, DD]
2 3 [GG, PP, DD]
CodePudding user response:
Use Series.duplicated
with keep=False
with tuples from lists:
df = dd[dd['lst'].apply(tuple).duplicated(keep=False)]
print (df)
id lst
0 1 [GG, PP, DD]
2 3 [GG, PP, DD]
CodePudding user response:
interesting when lst is a list of a list the dataframe duplicated says it is not hashable. I think pandas should modify duplicated allow list of a list by converting the list of lists to a list of tuples, as the accepted answer.
def find_duplicate(df, col):
"""
df: dataframe
col: column name
"""
df_dup = df[df.duplicated(subset=col,keep=False)]
return df_dup
d = {'id':[1,2,3], 'lst' : ['c','b','c']}
dd = pd.DataFrame(d)
print(dd)
print(find_duplicate(dd, 'lst'))
output:
id lst
0 1 c
2 3 c