I have a data frame that looks like this:
Identification | Date (day/month/year) | X | Y |
---|---|---|---|
123 | 01/01/2022 | NaN | abc |
123 | 02/01/2022 | 200 | acb |
123 | 03/01/2022 | 200 | ary |
124 | 01/01/2022 | 200 | abc |
124 | 02/01/2022 | NaN | abc |
124 | 03/01/2022 | NaN | NaN |
I am trying to create two separate 'change' columns, one for x and y separately, that is keeping a rolling count of how many times a given element is changing over time. I would like my output to look something like this, where NaN ---> NaN is not counted as a change but NaN ---> some element is counted:
Identification | Date (day/month/year) | X | Y | Change X | Change Y |
---|---|---|---|---|---|
123 | 01/01/2022 | NaN | abc | 0 | 0 |
123 | 02/01/2022 | 200 | acb | 1 | 1 |
123 | 03/01/2022 | 200 | ary | 1 | 2 |
124 | 01/01/2022 | 200 | abc | 0 | 0 |
124 | 02/01/2022 | NaN | abc | 1 | 0 |
124 | 03/01/2022 | NaN | NaN | 1 | 1 |
Thanks :)
CodePudding user response:
You can use a classical comparison with the next item (obtained with groupby.shift
) combined with a groupby.cumsum
, however a NaN compared with another NaN yields False
. To overcome this, we can first fillna
with an object that is not part of the dataset. Here I chose object
, it could be -1
if your data is strictly positive.
def change(s):
s = s.fillna(object)
return (s.ne(s.groupby(df['Identification']).shift())
.groupby(df['Identification']).cumsum().sub(1)
)
out = df.join(df[['X', 'Y']].apply(change).add_prefix('Change '))
print(out)
Output:
Identification Date (day/month/year) X Y Change X Change Y
0 123 01/01/2022 NaN abc 0 0
1 123 02/01/2022 200.0 acb 1 1
2 123 03/01/2022 200.0 ary 1 2
3 124 01/01/2022 200.0 abc 0 0
4 124 02/01/2022 NaN abc 1 0
5 124 03/01/2022 NaN NaN 1 1