Home > Enterprise >  Deleting duplicates with hierarchy condition SQL Python Pandas
Deleting duplicates with hierarchy condition SQL Python Pandas

Time:06-11

I need to delete duplicates in a large database, but the rows to be deleted must be based on a hierarchy using either SQLite or Python Pandas. Is there a efficient way to relize this? preferably using python pandas dataframe but SQLite is also fine.

ID Text Category
1 text Priority 3
2 text Priority 1
3 text Priority 2
4 text 2 Priority 3
5 text 2 Priority 2

should turn to this:

ID Text Category
2 text Priority 1
5 text 2 Priority 2

CodePudding user response:

Try this:

df = df.sort_values(by=['Text','Category'], ascending=[True,True])
df.groupby('Text')['Category'].first().reset_index()

Output:

index Text Category
0 text Priority 1
1 text 2 Priority 2

CodePudding user response:

Very similar approach to @Drakax but using drop_duplicates instead of groupby and first

import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Text': ['text', 'text', 'text', 'text 2', 'text 2'],
    'Category': ['Priority 3', 'Priority 1', 'Priority 2', 'Priority 3', 'Priority 2'],
})

df.sort_values(['Text','Category']).drop_duplicates('Text')
  • Related