I have a duplicate list.
I want to remove duplicate lines where there is no value 'sh'.
The example below works on a small list, but if the list is large, then this path does not delete correctly.
Is there any other way to perform the uninstall?
import pandas as pd
union_list = [
['10','robot_1','sh']
,['10','robot_1',' ']
,['11','robot_2','sh']
,['11','robot_2','']
,['12','robot_3','']
]
el = list(union_list)
df = pd.DataFrame(el)
df1 = df.drop_duplicates(0)
print(df1)
the result i want to get
0 1 2
0 10 robot_1 sh
2 11 robot_2 sh
4 12 robot_3
CodePudding user response:
If you have empty strings and 'sh' and want to keep the sh in case of duplicates, you can sort by all columns, which will move empty strings to the top, then drop_duplicates
keeping the last value:
df.sort_values(by=[0,1,2]).drop_duplicates(0, keep='last')
alternatively, to always prioritize "sh":
df.sort_values(by=2, key=lambda x: x=='sh').drop_duplicates(0, keep='last')
output:
0 1 2
0 10 robot_1 sh
2 11 robot_2 sh
4 12 robot_3