Home > Enterprise >  How calculate diff() in condition value? Python
How calculate diff() in condition value? Python

Time:04-29

I have a pandas df, like this:

    ID  date        value
0   10  2022-01-01  100
1   10  2022-01-02  150
2   10  2022-01-03  0
3   10  2022-01-04  0
4   10  2022-01-05  200
5   10  2022-01-06  0
6   10  2022-01-07  150
7   10  2022-01-08  0
8   10  2022-01-09  0
9   10  2022-01-10  0
10  10  2022-01-11  0
11  10  2022-01-12  100
12  23  2022-02-01  490
13  23  2022-02-02  0
14  23  2022-02-03  350
15  23  2022-02-04  333
16  23  2022-02-05  0
17  23  2022-02-06  0
18  23  2022-02-07  0
19  23  2022-02-08  211
20  23  2022-02-09  100

I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.

Output:

    ID  date        value  days_last_value
0   10  2022-01-01  100    0
1   10  2022-01-02  150    1
2   10  2022-01-03  0
3   10  2022-01-04  0
4   10  2022-01-05  200    3
5   10  2022-01-06  0
6   10  2022-01-07  150    2
7   10  2022-01-08  0
8   10  2022-01-09  0
9   10  2022-01-10  0
10  10  2022-01-11  0
11  10  2022-01-12  100    5
12  23  2022-02-01  490    0
13  23  2022-02-02  0
14  23  2022-02-03  350    2
15  23  2022-02-04  333    1
16  23  2022-02-05  0
17  23  2022-02-06  0
18  23  2022-02-07  0
19  23  2022-02-08  211    4
20  23  2022-02-09  100    1

CodePudding user response:

Explanation below.

import pandas as pd

df = pd.DataFrame({'ID': 12 * [10]   9 * [23], 
                   'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})

days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
    ID  value  days_last_value
0   10    100              0.0
1   10    150              1.0
2   10      0              NaN
3   10      0              NaN
4   10    200              3.0
5   10      0              NaN
6   10    150              2.0
7   10      0              NaN
8   10      0              NaN
9   10      0              NaN
10  10      0              NaN
11  10    100              5.0
12  23    490              0.0
13  23      0              NaN
14  23    350              2.0
15  23    333              1.0
16  23      0              NaN
17  23      0              NaN
18  23      0              NaN
19  23    211              4.0
20  23    100              1.0

First, we'll have to group by 'ID'. We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in

0      1
1      2
2      2
3      2
4      3
5      3
6      4
7      4
8      4
9      4
10     4
11     5
12     6
13     6
14     7
15     8
16     8
17     8
18     8
19     9
20    10

We can use the values in this series to also group on; combining that with the 'ID' group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.

Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we've counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).

Now, this new series needs to get assigned back to the dataframe, but it's obviously shorter. Since its index it also reset, we can't easily reassign it (not with df['days_last_value'], df.loc[...] or df.iloc).

Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that. Now, it's easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.

  • Related