Home > OS >  Pandas keep rows where column values change at least twice
Pandas keep rows where column values change at least twice

Time:02-11

Good day,

I have a large dataset with columns that keep track of the scores each person obtains. Here is a sample of the dataset:

In:

data = [[7, 10, 10, 10, 10], [17, 10, 10, 10, 10], [18, 8, 10, 10, 10], [20, 10, 10, 9, 9], [25, 9, 8, 8, 7]]
df = pd.DataFrame(data, columns = ['person_id', 'score_1', 'score_2', 'score_3', 'score_4'])
df

Out:

    person_id   score_1 score_2 score_3 score_4
0   7           10      10      10      10
1   17          10      10      10      10
2   18          8       10      10      10
3   20          10      10      9       9
4   25          9       8       8       7

I need to find the rows where the column values change at least twice. For example:

  • Row 0: All values are the same, no change
  • Row 1: All values are the same, no change
  • Row 2: 1st and 2nd values differ, 1 change
  • Row 3: 2nd and 3rd values differ, 1 change
  • Row 4: 1st, 2nd and 4th values differ, 2 changes

That means that only row 4 meets my requirements.

Thus, the desired output would be:

    person_id   score_1 score_2 score_3 score_4
4   25          9       8       8       7

All help greatly appreciated!

CodePudding user response:

IIUC, you want to count the number of unique values, per rows, limited to the "score*" columns.

You can use nunique on the rows after getting the correct columns with filter. Then slice:

df[df.filter(like='score').nunique(axis=1).gt(2)]

If you really want the changes from left to right so that A->B->A->B counts for 3 changes:

df[df.filter(like='score').diff(axis=1).ne(0).sum(axis=1).gt(2)]

output:

   person_id  score_1  score_2  score_3  score_4
4         25        9        8        8        7

CodePudding user response:

Mozway's answer for the case A->B->A->B counts is brillant:

df[df.filter(like='score').diff(axis=1).ne(0).sum(axis=1).gt(2)]

A more plain answer, based on Mozway's one could be :

df[ # following lines build a mask
    df.filter(like='score')     # keep only scores columns
      .diff(axis=1).iloc[:, 1:] # diff the columns and keep all except the first one whose values are all `NaN`
      .ne(0).sum(axis=1)        # count values with a change: they differ from 0 
      .ge(2)                    # select the requested rows: count >= 2
  ]
  • Related