Pandas how to find consecutive value increase over time on time series data-CodePudding

I have a dataframe with three columns (tilted as name, value, date)as below:

Tool	Value	Date
A	52.14	1/1
A	51.5	1/7
A	52	1/10
A	52.9	2/1
B	53.1	1/5
B	51.7	1/10
B	51.9	1/21
B	52.4	1/22
B	53.0	2/1
B	51.5	2/15

I would like to find which tools have increased measure values on three measurement days.

Tool B's value has increased on 1/21 and then increased on 1/22 and then increased on 2/1. so the outcome will be as below:

Tool	Value	Date
B	51.7	1/10
B	51.9	1/21
B	52.4	1/22
B	53.0	2/1

I am wondering how can I define a function in pandas to give the desired result. Thanks.

CodePudding user response：

Here's an answer, but I bet there is a better one. I make new Series to keep track of whether a value is less than the next (increasing), increasing 'runs' as groups (increase_group) and then how many consecutive increases happen in that group (consec_increases). I've kept these as stand-alone Series, but if you add them as columns to your table you can see the reasoning. Getting the 2/1 row added is a bit hacked together because it's not part of the same increase_group and I can't figure that out in a more clever way than just adding one more index past the group max

  Tool  Value  Date  increasing  increase_group  consec_increases
0    A  52.14   1/1       False               1                 0
1    A  51.50   1/7        True               2                 2
2    A  52.00  1/10        True               2                 2
3    A  52.90   2/1       False               3                 0
4    B  53.10   1/5       False               4                 0
5    B  51.70  1/10        True               5                 3
6    B  51.90  1/21        True               5                 3
7    B  52.40  1/22        True               5                 3
8    B  53.00   2/1       False               6                 0
9    B  51.50  2/15       False               6                 0

Here's the actual code

df = pd.DataFrame({
    'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
    'Value': [52.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
    'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})

#Assuming your table is already sorted by Tool then by Date like your example

#Bool Series that is True if the prev Tool value is lt (<) the next (increasing) 
increasing = df['Value'].lt(df.groupby('Tool')['Value'].shift(-1))
#df['increasing'] = increasing #uncomment if you want to see what's happening

#Group the rows by increasing 'runs'
increase_group = increasing.groupby(df['Tool'], group_keys=False).apply(lambda v: v.ne(v.shift(1))).cumsum()
#df['increase_group'] = increase_group #uncomment if you want to see what's happening

#Count the number of consecutive increases per group
consec_increases = increasing.groupby(increase_group).transform(lambda v: v.cumsum().max())
#df['consec_increases'] = consec_increases #uncomment if you want to see what's happening

#print(df) #uncomment if you want to see what's happening

#get the row indices of each group w/ 3 or more consec increases 
inds_per_group = (
    increase_group[consec_increases >= 3] #this is where you can set the threshold
        .groupby(increase_group)
        .apply(lambda g: list(g.index) [max(g.index) 1]) #get all the row inds AND one more
)

#use the row indices to get the output table
out_df = pd.concat(df.loc[inds].assign(group=i) for i,inds in enumerate(inds_per_group))

With output:

  Tool  Value  Date  group
5    B   51.7  1/10      0
6    B   51.9  1/21      0
7    B   52.4  1/22      0
8    B   53.0   2/1      0

The reason I've added a group column is to help you distinguish when you have multiple consecutive runs of 3 or more. For example if your original table had 50.14 instead of 52.14 for the first value:

df = pd.DataFrame({
    'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
    'Value': [50.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
    'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})

Then the output is

  Tool  Value  Date  group
0    A  50.14   1/1      0
1    A  51.50   1/7      0
2    A  52.00  1/10      0
3    A  52.90   2/1      0
5    B  51.70  1/10      1
6    B  51.90  1/21      1
7    B  52.40  1/22      1
8    B  53.00   2/1      1

CodePudding user response：

You can first sort everything according to "tool" and "date"

df = df.sort_values(by=["tool", "date"])

Then, you can use the shift() on "Value" to add another column which contains the "Value" from the day before.

df["value_minus_1"] = df["Value"].shift()

This gives you a column which contains the measurement of the previous day. Then, you can use shift(2) to get the measurements of two days ago.

df["value_minus_2"] = df["Value"].shift(2)

Finally, you can drop the rows that have nas and filter the rows:

df = df.dropna()
df = df[df.val > df.value_minus_1]
df = df[df.value_minus_1 > df.value_minus_2]