I am trying to find duplicates on a subset of a Pandas Dataframe. Where there is duplication I need to update a value with the previous row's column entry.
import pandas as pd
dct = {'day': ['Mon', 'Tue', 'Wed', 'Wed', 'Thur', 'Fri', 'Fri', 'Sat', 'Sun'],
'id': ['2021-01', '2021-01', '2021-02', 'xyz', '2021-09',
'2021-09', 'abc', '2021-08', '2021-01'],
'p1': [1,1,1,2,3,9,6,12,3]}
df = pd.DataFrame(dct)
day id p1
0 Mon 2021-01 1
1 Tue 2021-01 1
2 Wed 2021-02 1
3 Wed xyz 2
4 Thu 2021-09 3
5 Fri 2021-09 9
6 Fri abc 6
7 Sat 2021-08 12
8 Sun 2021-01 3
In the above simplified example I wish to find duplicate "day" entries then update the last entry "id" value with the fist entry "id" value.
For Example index 3 "id" becomes index 2 "id" etc.:
day id p1
0 Mon 2021-01 1
1 Tue 2021-01 1
2 Wed 2021-02 1
3 Wed 2021-02 2
4 Thu 2021-09 3
5 Fri 2021-09 9
6 Fri 2021-09 6
7 Sat 2021-08 12
8 Sun 2021-01 3
I can do this using df.iterrows()
but feel this is not the best approach as I have considerably large datasets.
I have attempted using a df.duplicated()
and the first and last "keep" property to set the new values but get warnings:
A value is trying to be set on a copy of a slice from a DataFrame
This code produces the warning:
df.loc[df.duplicated(subset=['day'],keep='last')]['id'] = df.loc[df.duplicated(subset=['day'],keep='first')]['id']
How can I achieve this result without encountering this type of warning whilst not using iterrows?
CodePudding user response:
Use a mask
and ffill
. For duplicated
you need to keep the first (i.e. mark the other ones as duplicated).
m = df['day'].duplicated()
df['id'] = df['id'].mask(m).ffill()
output:
day id p1
0 Mon 2021-01 1
1 Tue 2021-01 1
2 Wed 2021-02 1
3 Wed 2021-02 2
4 Thur 2021-09 3
5 Fri 2021-09 9
6 Fri 2021-09 6
7 Sat 2021-08 12
8 Sun 2021-01 3