Home > Back-end >  Pandas drop duplicates across columns
Pandas drop duplicates across columns

Time:02-10

Good day,

I need a way to check each row of a dataframe and drop the row if all the values in that row (across the score columns) are the same. The person_id may differ.

Here is a part of the dataset:

In:

data = [[7, 10, 10, 10, 10], [17, 10, 10, 10, 10], [18, 8, 10, 10, 10], [20, 10, 10, 9, 9], [25, 9, 8, 8, 7]]
df = pd.DataFrame(data, columns = ['person_id', 'score_1', 'score_2', 'score_3', 'score_4'])
df

Out:

    person_id   score_1 score_2 score_3 score_4
0   7           10      10      10      10
1   17          10      10      10      10
2   18          8       10      10      10
3   20          10      10      9       9
4   25          9       8       8       7

The desired output would be:

    person_id   score_1 score_2 score_3 score_4
2   18          8       10      10      10
3   20          10      10      9       9
4   25          9       8       8       7

Since row 0 (person_id 7) and row 1 (person_id 17) have the same scores. The number of columns will also change, adding more score columns - thus, I cannot use

df_no_duplicates = df.loc[(df.score_1 != df.score_2) | (df.score_2 != df.score_3)| (df.score_3 != df.score_4)]

CodePudding user response:

A cheeky way to omit the person_id column from the process is to set it as the index temporarily.

result = df.set_index('person_id').drop_duplicates(keep=False).reset_index()

You could also use the subset argument of drop_duplicates:

result = df.drop_duplicates(subset=df.columns.difference(['person_id']), keep=False)

CodePudding user response:

You can try nunique

out = df[df.filter(like='score').nunique(1)>1].copy()
Out[208]: 
   person_id  score_1  score_2  score_3  score_4
2         18        8       10       10       10
3         20       10       10        9        9
4         25        9        8        8        7
  • Related