I need to delete duplicates in a large database, but the rows to be deleted must be based on a hierarchy using either SQLite or Python Pandas. Is there a efficient way to relize this? preferably using python pandas dataframe but SQLite is also fine.
ID | Text | Category |
---|---|---|
1 | text | Priority 3 |
2 | text | Priority 1 |
3 | text | Priority 2 |
4 | text 2 | Priority 3 |
5 | text 2 | Priority 2 |
should turn to this:
ID | Text | Category |
---|---|---|
2 | text | Priority 1 |
5 | text 2 | Priority 2 |
CodePudding user response:
Try this:
df = df.sort_values(by=['Text','Category'], ascending=[True,True])
df.groupby('Text')['Category'].first().reset_index()
Output:
index | Text | Category |
---|---|---|
0 | text | Priority 1 |
1 | text 2 | Priority 2 |
CodePudding user response:
Very similar approach to @Drakax but using drop_duplicates
instead of groupby
and first
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4, 5],
'Text': ['text', 'text', 'text', 'text 2', 'text 2'],
'Category': ['Priority 3', 'Priority 1', 'Priority 2', 'Priority 3', 'Priority 2'],
})
df.sort_values(['Text','Category']).drop_duplicates('Text')