I have a DataFrame
with multiple columns and the last column is timestamp
which I want Python to ignore. I've used drop_columns(subset=...)
but does not work as it returns literally the same DataFrame
.
This is what the DataFrame
looks like:
id | name | features | timestamp | |
---|---|---|---|---|
1 | 34233 | Bob | athletics | 04-06-2022 |
2 | 23423 | John | mathematics | 03-06-2022 |
3 | 34233 | Bob | english_literature | 06-06-2022 |
4 | 23423 | John | mathematics | 10-06-2022 |
... | ... | ... | ... | ... |
And this is are the data types when doing df.dtypes
:
id | int64 |
name | object |
features | object |
timestamp | object |
Lastly, this is the piece of code I used:
df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True)
The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics
to english_literature
. Does that make sense?
CodePudding user response:
The remove
method of a list returns None
. That's why the returned dataframe is similar. You can do as follows:
- Create the list of columns for the subset:
col_subset = df.columns.tolist()
- Remove timestamp:
col_subset.remove('timestamp')
- Use the col_subset list in the
drop_duplicates()
function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)
CodePudding user response:
You can do that using the method drop.
here is an working example: https://abstra.show/724UMzdRXx
CodePudding user response:
Try this:
df.drop_duplicates(subset=[x for x in df.columns if x != "timestamp"]).reset_index(drop=True)