Situation
I have dataframe similar to below ( although I've removed many of the rows for this example, as evidenced in the 'index' column):
df
index | id | name | last_updated |
---|---|---|---|
0 | 1518 | Maker | 2022-12-31T03:02:00.000Z |
1 | 1518 | Maker | 2022-12-31T02:02:00.000Z |
2 | 1518 | Maker | 2022-12-31T14:02:00.000Z |
3 | 1518 | Maker | 2022-12-31T16:02:00.000Z |
23 | 1518 | Maker | 2022-12-31T17:02:00.000Z |
24 | 2280 | Filecoin | 2022-12-31T01:02:00.000Z |
25 | 2280 | Filecoin | 2022-12-31T03:01:00.000Z |
26 | 2280 | Filecoin | 2022-12-31T02:01:00.000Z |
27 | 2280 | Filecoin | 2022-12-31T00:02:00.000Z |
47 | 2280 | Filecoin | 2022-12-31T08:02:00.000Z |
48 | 4558 | Flow | 2022-12-31T01:02:00.000Z |
49 | 4558 | Flow | 2022-12-31T02:01:00.000Z |
71 | 4558 | Flow | 2022-12-31T05:02:00.000Z |
72 | 5026 | Orchid | 2022-12-31T01:02:00.000Z |
73 | 5026 | Orchid | 2022-12-31T03:02:00.000Z |
74 | 5026 | Orchid | 2022-12-31T02:01:00.000Z |
75 | 5026 | Orchid | 2022-12-31T00:02:00.000Z |
I want a version of the above dataframe but with only 1 row for each id
parameter. Keeping the last instance.
This is my code:
df.drop_duplicates(subset=['id'], keep='last')
Expectation
That the new df would retain only 4 rows, the 'last' instance for each 'id' value in dataframe df
.
Result
After running the drop_duplicates
command, the df
returns the exact same dataframe. Same shape as prior to my drop_duplicates
attempt.
I've been trying to use this post to sort it out, but obvs there's something I'm not getting right:
pandas select rows with no duplicate
I'd appreciate any input on why the last instance of rows with duplicate 'id' values are not being dropped.
CodePudding user response:
You should add df.drop_duplicates(subset=['id'], keep='last', inplace=True)
. If you don't do this, only a copy is returned. By specifying inplace=True, the dataframe is modified.
See documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
Hope this helps!
CodePudding user response:
Either reassign the variable:
str = """
index id name last_updated
0 1518 Maker 2022-12-31T03:02:00.000Z
1 1518 Maker 2022-12-31T02:02:00.000Z
2 1518 Maker 2022-12-31T14:02:00.000Z
3 1518 Maker 2022-12-31T16:02:00.000Z
23 1518 Maker 2022-12-31T17:02:00.000Z
24 2280 Filecoin 2022-12-31T01:02:00.000Z
25 2280 Filecoin 2022-12-31T03:01:00.000Z
26 2280 Filecoin 2022-12-31T02:01:00.000Z
27 2280 Filecoin 2022-12-31T00:02:00.000Z
47 2280 Filecoin 2022-12-31T08:02:00.000Z
48 4558 Flow 2022-12-31T01:02:00.000Z
49 4558 Flow 2022-12-31T02:01:00.000Z
71 4558 Flow 2022-12-31T05:02:00.000Z
72 5026 Orchid 2022-12-31T01:02:00.000Z
73 5026 Orchid 2022-12-31T03:02:00.000Z
74 5026 Orchid 2022-12-31T02:01:00.000Z
75 5026 Orchid 2022-12-31T00:02:00.000Z
"""
csvStringIO = StringIO(str)
df = pd.read_csv(csvStringIO, sep="\t")
df = df.drop_duplicates(subset=['id'], keep='last')
print(df) # 4 rows
Or set inplace
to True
:
df.drop_duplicates(subset=['id'], keep='last', inplace=True)
print(df) # 4 rows