Pandas: Calculate running difference based on condition from another column-CodePudding

I want to calculate the running difference of column ['Values'] based on a binary condition in another column ['Conditions']. If condition is 0 then it calculates the difference of the current row and preceding row. If condition is 1 then it calculates the difference of the current row and the previous row where the condition was also 1 like so:


  Values    Condition   Desired_Output
0   5000            1              NaN
1   5500            0            500.0
2   6700            1           1700.0
3   7100            0            400.0
4   8500            0           1400.0
5   9000            0            500.0
6   10500           1           3800.0
7   15750           0           5250.0
8   18000           1           7500.0
9   22250           0           4250.0
10  26000           0           3750.0
11  29750           0           3750.0
12  33500           0           3750.0
13  37250           0           3750.0
14  41000           1          23000.0
15  44750           0           3750.0
16  48500           1           7500.0
17  52250           1           3750.0
18  56000           0           3750.0
19  59750           1           7500.0
20  63500           0           3750.0
21  67250           0           3750.0
22  71000           0           3750.0
23  74750           0           3750.0
24  78500           0           3750.0
25  82250           1          22500.0
26  86000           0           3750.0
27  89750           1           7500.0

I tried using the groupby function with no such luck.

df.insert(2, 'Difference', (df.groupby('Condition')['Values'].diff()))

When I filter the dataframe based on the conditions and calculate the difference then I get close to the desired output however I have to work with two columns in that case. Is there a way to perform this function in a single column? I am fairly new to Python and would appreciate some help

CodePudding user response：

First calculate the difference between two successive rows for the whole column with diff, then replace the rows with condition being 1 by the difference on the Value column once selected only the rows with 1 in condition, by index alignment it should work.

m = df['Condition'].astype(bool)
df['res'] = df['Values'].diff()
df.loc[m, 'res'] = df.loc[m, 'Values'].diff()
print(df)
#     Values  Condition  Desired_Output      res
# 0     5000          1             NaN      NaN
# 1     5500          0           500.0    500.0
# 2     6700          1          1700.0   1700.0
# 3     7100          0           400.0    400.0
# 4     8500          0          1400.0   1400.0
# 5     9000          0           500.0    500.0
# 6    10500          1          3800.0   3800.0
# 7    15750          0          5250.0   5250.0
# 8    18000          1          7500.0   7500.0

if you want a one liner, you can do it using where, mask and ffill

df['res2'] = df['Values'].diff().mask(m, df['Values'].where(m.astype(bool)).ffill().diff())

CodePudding user response：

It sounds like you can calculate the differences for each line, and then override that with the differences in the Condition == 1 lines.

This would look like this:

Values = [5000, 5500, 6700, 7100, 8500, 9000, 10500, 15750, 18000, 22250, 26000]
Condition = [1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0]
df = pd.DataFrame(data={"Values":Values, "Condition":Condition})
df["Desired_Output"] = df.Values.diff()
df.loc[df.Condition == 1, "Desired_Output"] = df[df.Condition==1].Values.diff()
print(df)

The output is:

   Values  Condition  Desired_Output
0     5000          1             NaN
1     5500          0           500.0
2     6700          1          1700.0
3     7100          0           400.0
4     8500          0          1400.0
5     9000          0           500.0
6    10500          1          3800.0
7    15750          0          5250.0
8    18000          1          7500.0
9    22250          0          4250.0
10   26000          0          3750.0