Good day,
I have a large dataset with columns that keep track of the scores each person obtains. Here is a sample of the dataset:
In:
data = [[7, 10, 10, 10, 10], [17, 10, 10, 10, 10], [18, 8, 10, 10, 10], [20, 10, 10, 9, 9], [25, 9, 8, 8, 7]]
df = pd.DataFrame(data, columns = ['person_id', 'score_1', 'score_2', 'score_3', 'score_4'])
df
Out:
person_id score_1 score_2 score_3 score_4
0 7 10 10 10 10
1 17 10 10 10 10
2 18 8 10 10 10
3 20 10 10 9 9
4 25 9 8 8 7
I need to find the rows where the column values change at least twice. For example:
- Row 0: All values are the same, no change
- Row 1: All values are the same, no change
- Row 2: 1st and 2nd values differ, 1 change
- Row 3: 2nd and 3rd values differ, 1 change
- Row 4: 1st, 2nd and 4th values differ, 2 changes
That means that only row 4 meets my requirements.
Thus, the desired output would be:
person_id score_1 score_2 score_3 score_4
4 25 9 8 8 7
All help greatly appreciated!
CodePudding user response:
IIUC, you want to count the number of unique values, per rows, limited to the "score*" columns.
You can use nunique
on the rows after getting the correct columns with filter
. Then slice:
df[df.filter(like='score').nunique(axis=1).gt(2)]
If you really want the changes from left to right so that A->B->A->B counts for 3 changes:
df[df.filter(like='score').diff(axis=1).ne(0).sum(axis=1).gt(2)]
output:
person_id score_1 score_2 score_3 score_4
4 25 9 8 8 7
CodePudding user response:
Mozway's answer for the case A->B->A->B counts is brillant:
df[df.filter(like='score').diff(axis=1).ne(0).sum(axis=1).gt(2)]
A more plain answer, based on Mozway's one could be :
df[ # following lines build a mask
df.filter(like='score') # keep only scores columns
.diff(axis=1).iloc[:, 1:] # diff the columns and keep all except the first one whose values are all `NaN`
.ne(0).sum(axis=1) # count values with a change: they differ from 0
.ge(2) # select the requested rows: count >= 2
]