I want to calculate the running difference of column ['Values'] based on a binary condition in another column ['Conditions']. If condition is 0 then it calculates the difference of the current row and preceding row. If condition is 1 then it calculates the difference of the current row and the previous row where the condition was also 1 like so:
Values Condition Desired_Output
0 5000 1 NaN
1 5500 0 500.0
2 6700 1 1700.0
3 7100 0 400.0
4 8500 0 1400.0
5 9000 0 500.0
6 10500 1 3800.0
7 15750 0 5250.0
8 18000 1 7500.0
9 22250 0 4250.0
10 26000 0 3750.0
11 29750 0 3750.0
12 33500 0 3750.0
13 37250 0 3750.0
14 41000 1 23000.0
15 44750 0 3750.0
16 48500 1 7500.0
17 52250 1 3750.0
18 56000 0 3750.0
19 59750 1 7500.0
20 63500 0 3750.0
21 67250 0 3750.0
22 71000 0 3750.0
23 74750 0 3750.0
24 78500 0 3750.0
25 82250 1 22500.0
26 86000 0 3750.0
27 89750 1 7500.0
I tried using the groupby function with no such luck.
df.insert(2, 'Difference', (df.groupby('Condition')['Values'].diff()))
When I filter the dataframe based on the conditions and calculate the difference then I get close to the desired output however I have to work with two columns in that case. Is there a way to perform this function in a single column? I am fairly new to Python and would appreciate some help
CodePudding user response:
First calculate the difference between two successive rows for the whole column with diff
, then replace the rows with condition being 1 by the difference on the Value column once selected only the rows with 1 in condition, by index alignment it should work.
m = df['Condition'].astype(bool)
df['res'] = df['Values'].diff()
df.loc[m, 'res'] = df.loc[m, 'Values'].diff()
print(df)
# Values Condition Desired_Output res
# 0 5000 1 NaN NaN
# 1 5500 0 500.0 500.0
# 2 6700 1 1700.0 1700.0
# 3 7100 0 400.0 400.0
# 4 8500 0 1400.0 1400.0
# 5 9000 0 500.0 500.0
# 6 10500 1 3800.0 3800.0
# 7 15750 0 5250.0 5250.0
# 8 18000 1 7500.0 7500.0
if you want a one liner, you can do it using where
, mask
and ffill
df['res2'] = df['Values'].diff().mask(m, df['Values'].where(m.astype(bool)).ffill().diff())
CodePudding user response:
It sounds like you can calculate the differences for each line, and then override that with the differences in the Condition == 1 lines.
This would look like this:
Values = [5000, 5500, 6700, 7100, 8500, 9000, 10500, 15750, 18000, 22250, 26000]
Condition = [1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0]
df = pd.DataFrame(data={"Values":Values, "Condition":Condition})
df["Desired_Output"] = df.Values.diff()
df.loc[df.Condition == 1, "Desired_Output"] = df[df.Condition==1].Values.diff()
print(df)
The output is:
Values Condition Desired_Output
0 5000 1 NaN
1 5500 0 500.0
2 6700 1 1700.0
3 7100 0 400.0
4 8500 0 1400.0
5 9000 0 500.0
6 10500 1 3800.0
7 15750 0 5250.0
8 18000 1 7500.0
9 22250 0 4250.0
10 26000 0 3750.0