I have a dataframe with three columns (tilted as name, value, date)as below:
Tool | Value | Date |
---|---|---|
A | 52.14 | 1/1 |
A | 51.5 | 1/7 |
A | 52 | 1/10 |
A | 52.9 | 2/1 |
B | 53.1 | 1/5 |
B | 51.7 | 1/10 |
B | 51.9 | 1/21 |
B | 52.4 | 1/22 |
B | 53.0 | 2/1 |
B | 51.5 | 2/15 |
I would like to find which tools have increased measure values on three measurement days.
Tool B's value has increased on 1/21 and then increased on 1/22 and then increased on 2/1. so the outcome will be as below:
Tool | Value | Date |
---|---|---|
B | 51.7 | 1/10 |
B | 51.9 | 1/21 |
B | 52.4 | 1/22 |
B | 53.0 | 2/1 |
I am wondering how can I define a function in pandas to give the desired result. Thanks.
CodePudding user response:
Here's an answer, but I bet there is a better one. I make new Series to keep track of whether a value is less than the next (increasing
), increasing 'runs' as groups (increase_group
) and then how many consecutive increases happen in that group (consec_increases
). I've kept these as stand-alone Series, but if you add them as columns to your table you can see the reasoning. Getting the 2/1
row added is a bit hacked together because it's not part of the same increase_group
and I can't figure that out in a more clever way than just adding one more index past the group max
Tool Value Date increasing increase_group consec_increases
0 A 52.14 1/1 False 1 0
1 A 51.50 1/7 True 2 2
2 A 52.00 1/10 True 2 2
3 A 52.90 2/1 False 3 0
4 B 53.10 1/5 False 4 0
5 B 51.70 1/10 True 5 3
6 B 51.90 1/21 True 5 3
7 B 52.40 1/22 True 5 3
8 B 53.00 2/1 False 6 0
9 B 51.50 2/15 False 6 0
Here's the actual code
df = pd.DataFrame({
'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Value': [52.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})
#Assuming your table is already sorted by Tool then by Date like your example
#Bool Series that is True if the prev Tool value is lt (<) the next (increasing)
increasing = df['Value'].lt(df.groupby('Tool')['Value'].shift(-1))
#df['increasing'] = increasing #uncomment if you want to see what's happening
#Group the rows by increasing 'runs'
increase_group = increasing.groupby(df['Tool'], group_keys=False).apply(lambda v: v.ne(v.shift(1))).cumsum()
#df['increase_group'] = increase_group #uncomment if you want to see what's happening
#Count the number of consecutive increases per group
consec_increases = increasing.groupby(increase_group).transform(lambda v: v.cumsum().max())
#df['consec_increases'] = consec_increases #uncomment if you want to see what's happening
#print(df) #uncomment if you want to see what's happening
#get the row indices of each group w/ 3 or more consec increases
inds_per_group = (
increase_group[consec_increases >= 3] #this is where you can set the threshold
.groupby(increase_group)
.apply(lambda g: list(g.index) [max(g.index) 1]) #get all the row inds AND one more
)
#use the row indices to get the output table
out_df = pd.concat(df.loc[inds].assign(group=i) for i,inds in enumerate(inds_per_group))
With output:
Tool Value Date group
5 B 51.7 1/10 0
6 B 51.9 1/21 0
7 B 52.4 1/22 0
8 B 53.0 2/1 0
The reason I've added a group
column is to help you distinguish when you have multiple consecutive runs of 3 or more. For example if your original table had 50.14 instead of 52.14 for the first value:
df = pd.DataFrame({
'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Value': [50.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})
Then the output is
Tool Value Date group
0 A 50.14 1/1 0
1 A 51.50 1/7 0
2 A 52.00 1/10 0
3 A 52.90 2/1 0
5 B 51.70 1/10 1
6 B 51.90 1/21 1
7 B 52.40 1/22 1
8 B 53.00 2/1 1
CodePudding user response:
You can first sort everything according to "tool" and "date"
df = df.sort_values(by=["tool", "date"])
Then, you can use the shift()
on "Value" to add another column which contains the "Value" from the day before.
df["value_minus_1"] = df["Value"].shift()
This gives you a column which contains the measurement of the previous day. Then, you can use shift(2)
to get the measurements of two days ago.
df["value_minus_2"] = df["Value"].shift(2)
Finally, you can drop the rows that have na
s and filter the rows:
df = df.dropna()
df = df[df.val > df.value_minus_1]
df = df[df.value_minus_1 > df.value_minus_2]