Good day,
I need a way to check each row of a dataframe and drop the row if all the values in that row (across the score
columns) are the same. The person_id
may differ.
Here is a part of the dataset:
In:
data = [[7, 10, 10, 10, 10], [17, 10, 10, 10, 10], [18, 8, 10, 10, 10], [20, 10, 10, 9, 9], [25, 9, 8, 8, 7]]
df = pd.DataFrame(data, columns = ['person_id', 'score_1', 'score_2', 'score_3', 'score_4'])
df
Out:
person_id score_1 score_2 score_3 score_4
0 7 10 10 10 10
1 17 10 10 10 10
2 18 8 10 10 10
3 20 10 10 9 9
4 25 9 8 8 7
The desired output would be:
person_id score_1 score_2 score_3 score_4
2 18 8 10 10 10
3 20 10 10 9 9
4 25 9 8 8 7
Since row 0 (person_id
7) and row 1 (person_id
17) have the same scores.
The number of columns will also change, adding more score
columns - thus, I cannot use
df_no_duplicates = df.loc[(df.score_1 != df.score_2) | (df.score_2 != df.score_3)| (df.score_3 != df.score_4)]
CodePudding user response:
A cheeky way to omit the person_id column from the process is to set it as the index temporarily.
result = df.set_index('person_id').drop_duplicates(keep=False).reset_index()
You could also use the subset
argument of drop_duplicates
:
result = df.drop_duplicates(subset=df.columns.difference(['person_id']), keep=False)
CodePudding user response:
You can try nunique
out = df[df.filter(like='score').nunique(1)>1].copy()
Out[208]:
person_id score_1 score_2 score_3 score_4
2 18 8 10 10 10
3 20 10 10 9 9
4 25 9 8 8 7