I'm trying to write a script that finds duplicate rows in a spreadsheet. I'm using the Pandas library. This is the initial dataframe:
import pandas as pd
df = pd.DataFrame({'title': [1, 2, 3, 4, 5, 6, 7, 8],
'val1': [1.1, 1.1, 2.1, 8.8, 1.1, 1.1, 8.8, 8.8],
'val2': [2.2, 3.3, 5.5, 6.2, 2.2, 3.3, 6.2, 6.2],
'val3': [3.4, 4.4, 5.5, 8.4, 0.5, 3.4, 1.9, 3.7]
})
print(df)
title val1 val2 val3
1 1.1 2.2 3.4
2 1.1 3.3 4.4
3 2.1 5.5 5.5
4 8.8 6.2 8.4
5 1.1 2.2 0.5
6 1.1 3.3 3.4
7 8.8 6.2 1.9
8 8.8 6.2 3.7
I have found all duplicate rows using the duplicated method based on the indicated columns and marked them by adding a new column e.g.
df['duplicate'] = df.duplicated(keep=False, subset=['val1', 'val2'])
print(df)
title val1 val2 duplicated
1 1.1 2.2 true
2 1.1 3.3 true
3 2.1 5.5 false
4 8.8 6.2 true
5 1.1 2.2 true
6 1.1 3.3 true
7 8.8 6.2 true
8 8.8 6.2 true
In the last step, I would like to mark all duplicate rows by adding information with the title of the first occurrence. This way I want to make it easier to sort and group them later. This is what the result would look like:
title val1 val2 first_occurence
1 1.1 2.2 true
2 1.1 3.3 true
3 2.1 5.5 false
4 8.8 6.2 true
5 1.1 2.2 title1
6 1.1 3.3 title2
7 8.8 6.2 title4
8 8.8 6.2 title4
I tried to find a similar topic, but was unsuccessful. Does anyone have an idea how to do it?
CodePudding user response:
You can use two variants of duplicated
(with and without keep=False
) and a merge
:
cols = ['val1', 'val2']
# duplicates, except first
m1 = df.duplicated(cols)
# duplicates including first
m2 = df.duplicated(cols, keep=False)
df['first_occurrence'] = (
df[cols].merge(df.loc[~m1, cols ['title']],
how='left')['title']
.astype(str).radd('title')
.where(m1, m2)
)
Output:
title val1 val2 val3 first_occurrence
0 1 1.1 2.2 3.4 True
1 2 1.1 3.3 4.4 True
2 3 2.1 5.5 5.5 False
3 4 8.8 6.2 8.4 True
4 5 1.1 2.2 0.5 title1
5 6 1.1 3.3 3.4 title2
6 7 8.8 6.2 1.9 title4
7 8 8.8 6.2 3.7 title4
CodePudding user response:
You can't do in Pandas. That's a possible solution:
- Get a list of duplicate rows
- Iterate this list and generate a new row with a new column like "duplicate_index" and insert in this column the title number of the first equal row for each duplicated rows
- Insert all rows (original with empty value in "duplicate_index") in a new df
- Save the new df