How to keep a cumulative count of changes across row elements, ignoring NaNs, and creating a separat-CodePudding

I have a data frame that looks like this:

Identification	Date (day/month/year)	X	Y
123	01/01/2022	NaN	abc
123	02/01/2022	200	acb
123	03/01/2022	200	ary
124	01/01/2022	200	abc
124	02/01/2022	NaN	abc
124	03/01/2022	NaN	NaN

I am trying to create two separate 'change' columns, one for x and y separately, that is keeping a rolling count of how many times a given element is changing over time. I would like my output to look something like this, where NaN ---> NaN is not counted as a change but NaN ---> some element is counted:

Identification	Date (day/month/year)	X	Y	Change X	Change Y
123	01/01/2022	NaN	abc	0	0
123	02/01/2022	200	acb	1	1
123	03/01/2022	200	ary	1	2
124	01/01/2022	200	abc	0	0
124	02/01/2022	NaN	abc	1	0
124	03/01/2022	NaN	NaN	1	1

Thanks :)

CodePudding user response：

You can use a classical comparison with the next item (obtained with groupby.shift) combined with a groupby.cumsum, however a NaN compared with another NaN yields False. To overcome this, we can first fillna with an object that is not part of the dataset. Here I chose object, it could be -1 if your data is strictly positive.

def change(s):
    s = s.fillna(object)
    return (s.ne(s.groupby(df['Identification']).shift())
            .groupby(df['Identification']).cumsum().sub(1)
            )

out = df.join(df[['X', 'Y']].apply(change).add_prefix('Change '))

print(out)

Output:

   Identification Date (day/month/year)      X    Y  Change X  Change Y
0             123            01/01/2022    NaN  abc         0         0
1             123            02/01/2022  200.0  acb         1         1
2             123            03/01/2022  200.0  ary         1         2
3             124            01/01/2022  200.0  abc         0         0
4             124            02/01/2022    NaN  abc         1         0
5             124            03/01/2022    NaN  NaN         1         1