Home > Software design >  Summing values up to a column value change in pandas dataframe
Summing values up to a column value change in pandas dataframe

Time:07-05

I have a pandas data frame that looks like this:

            Count  Status
Date                                                                       
2021-01-01  11     1
2021-01-02  13     1
2021-01-03  14     1
2021-01-04  8      0
2021-01-05  8      0
2021-01-06  5      0
2021-01-07  2      0
2021-01-08  6      1
2021-01-09  8      1
2021-01-10  10     0

I want to calculate the difference between the initial and final value of the "Count" column before the "Status" column changes from 0 to 1 or vice-versa (for every cycle) and make a new dataframe out of these values.

The output for this example would be:

Cycle  Difference
1       3
2      -6
3       2

CodePudding user response:

Use GroupBy.agg by consecutive groups created by comapre shifted values with cumulative sum, last subtract last and first value:

df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
        .agg(['first','last'])
        .eval('last - first')
        .reset_index(name='Difference'))
print (df)
   Cycle  Difference
0      1           3
1      2          -6
2      3           2
3      4           0

If need filter out groups rows with 1 row is possible add aggregation GroupBy.size and then filter oupt rows by DataFrame.loc:

df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
        .agg(['first','last', 'size'])
        .loc[lambda x: x['size'] > 1]
        .eval('last - first')
        .reset_index(name='Difference'))
print (df)
   Cycle  Difference
0      1           3
1      2          -6
2      3           2

CodePudding user response:

You can use a GroupBy.agg on the groups formed of the consecutive values, then get the first minus last value (see below for variants):

out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
         ['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0])
       )

output:

Status
1    3
2   -6
3    2
4    0
Name: Count, dtype: int64

If you only want to do this for groups of more than one element:

out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
         ['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
         .dropna()
       )

output:

Status
1     3
2    -6
3     2
Name: Count, dtype: object

output as DataFrame:

add .rename_axis('Cycle').reset_index(name='Difference'):

out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
         ['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
         .dropna()
         .rename_axis('Cycle').reset_index(name='Difference')
       )

output:

   Cycle Difference
0      1          3
1      2         -6
2      3          2
  • Related