How to drop duplicates ignoring one column-CodePudding

I have a DataFrame with multiple columns and the last column is timestamp which I want Python to ignore. I've used drop_columns(subset=...) but does not work as it returns literally the same DataFrame.

This is what the DataFrame looks like:

	id	name	features	timestamp
1	34233	Bob	athletics	04-06-2022
2	23423	John	mathematics	03-06-2022
3	34233	Bob	english_literature	06-06-2022
4	23423	John	mathematics	10-06-2022
...	...	...	...	...

And this is are the data types when doing df.dtypes:


id	int64
name	object
features	object
timestamp	object

Lastly, this is the piece of code I used:

df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True)

The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics to english_literature. Does that make sense?

CodePudding user response：

The remove method of a list returns None. That's why the returned dataframe is similar. You can do as follows:

Create the list of columns for the subset: col_subset = df.columns.tolist()
Remove timestamp: col_subset.remove('timestamp')
Use the col_subset list in the drop_duplicates() function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)

CodePudding user response：

You can do that using the method drop.

here is an working example: https://abstra.show/724UMzdRXx

CodePudding user response：

Try this:

df.drop_duplicates(subset=[x for x in df.columns if x != "timestamp"]).reset_index(drop=True)