Is there a way to customize drop_duplicates so that it drops the "kind of" duplicates?
Example: pandas df
Year | Name | ID | City |
---|---|---|---|
2011 | Superman | 101 | Metropolis |
2011 | Batman | 102 | Gotham |
2012 | The Batman | 102 | Gotham |
2011 | Noobmaster69 | 103 | Online |
2011 | Noobmaster69 | 103 | Online |
I tried using drop_duplicates so I got this
Year | Name | ID | City |
---|---|---|---|
2011 | Superman | 101 | Metropolis |
2011 | Batman | 102 | Gotham |
2012 | The Batman | 102 | Gotham |
2011 | Noobmaster69 | 103 | Online |
I actually want to squeeze it even more, as I want only "102" row with "The Batman" which is newer info (2012>2011) to be on the data frame. Expecting something like this
Year | Name | ID | City |
---|---|---|---|
2011 | Superman | 101 | Metropolis |
2012 | The Batman | 102 | Gotham |
2011 | Noobmaster69 | 103 | Online |
CodePudding user response:
#Try This Here Duplicates can be easily delete with ID column.
import pandas as pd
#reads your table data
read_file = pd.read_csv("your_filename.csv")
df = pd.DataFrame(read_file)
df = df.drop_duplicates(subset='ID', keep='last')
subset = "specific_col" used to drop the items from the specific column and keep = "last" used to keep the last duplicate(removes first duplicate)