I have a pandas data frame that looks like this:
Count Status
Date
2021-01-01 11 1
2021-01-02 13 1
2021-01-03 14 1
2021-01-04 8 0
2021-01-05 8 0
2021-01-06 5 0
2021-01-07 2 0
2021-01-08 6 1
2021-01-09 8 1
2021-01-10 10 0
I want to calculate the difference between the initial and final value of the "Count" column before the "Status" column changes from 0 to 1 or vice-versa (for every cycle) and make a new dataframe out of these values.
The output for this example would be:
Cycle Difference
1 3
2 -6
3 2
CodePudding user response:
Use GroupBy.agg
by consecutive groups created by comapre shifted values with cumulative sum, last subtract last and first value:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last'])
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
3 4 0
If need filter out groups rows with 1 row is possible add aggregation GroupBy.size
and then filter oupt rows by DataFrame.loc
:
df = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum().rename('Cycle'))['Count']
.agg(['first','last', 'size'])
.loc[lambda x: x['size'] > 1]
.eval('last - first')
.reset_index(name='Difference'))
print (df)
Cycle Difference
0 1 3
1 2 -6
2 3 2
CodePudding user response:
You can use a GroupBy.agg
on the groups formed of the consecutive values, then get the first minus last value (see below for variants):
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0])
)
output:
Status
1 3
2 -6
3 2
4 0
Name: Count, dtype: int64
If you only want to do this for groups of more than one element:
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
)
output:
Status
1 3
2 -6
3 2
Name: Count, dtype: object
output as DataFrame:
add .rename_axis('Cycle').reset_index(name='Difference')
:
out = (df.groupby(df['Status'].ne(df['Status'].shift()).cumsum())
['Count'].agg(lambda x: x.iloc[-1]-x.iloc[0] if len(x)>1 else pd.NA)
.dropna()
.rename_axis('Cycle').reset_index(name='Difference')
)
output:
Cycle Difference
0 1 3
1 2 -6
2 3 2