Home > OS >  How to drop duplicates ignoring one column
How to drop duplicates ignoring one column

Time:06-05

I have a DataFrame with multiple columns and the last column is timestamp which I want Python to ignore. I've used drop_columns(subset=...) but does not work as it returns literally the same DataFrame.

This is what the DataFrame looks like:

id name features timestamp
1 34233 Bob athletics 04-06-2022
2 23423 John mathematics 03-06-2022
3 34233 Bob english_literature 06-06-2022
4 23423 John mathematics 10-06-2022
... ... ... ... ...

And this is are the data types when doing df.dtypes:

id int64
name object
features object
timestamp object

Lastly, this is the piece of code I used:

df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True)

The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics to english_literature. Does that make sense?

CodePudding user response:

The remove method of a list returns None. That's why the returned dataframe is similar. You can do as follows:

  1. Create the list of columns for the subset: col_subset = df.columns.tolist()
  2. Remove timestamp: col_subset.remove('timestamp')
  3. Use the col_subset list in the drop_duplicates() function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)

CodePudding user response:

You can do that using the method drop.

here is an working example: https://abstra.show/724UMzdRXx

CodePudding user response:

Try this:

df.drop_duplicates(subset=[x for x in df.columns if x != "timestamp"]).reset_index(drop=True)
  • Related