Let's say I have a Pandas Dataframe of the price and stock history of a product at 10 different points in time:
df = pd.DataFrame(index=[np.arange(10)])
df['price'] = 10,10,11,15,20,10,10,11,15,20
df['stock'] = 30,20,13,8,4,30,20,13,8,4
df
price stock
0 10 30
1 10 20
2 11 13
3 15 8
4 20 4
5 10 30
6 10 20
7 11 13
8 15 8
9 20 4
How do I perform operations between specific rows that meet certain criteria?
In my example row 0 and row 5 meet the criteria "stock over 25" and row 4 and row 9 meet the criteria "stock under 5".
I would like to calculate:
df['price'][4] - df['price'][0] and
df['price'][9] - df['price'][5]
but not
df['price'][9] - df['price'][0] or
df['price'][4] - df['price'][5].
In other words, I would like to calculate the price change between the most recent event where stock was under 5 vs the most recent event where stock was over 25; over the whole series.
Of course, I would like to do this over larger datasets where picking them manually is not good.
CodePudding user response:
First, set up data frame and add some calculations:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[np.arange(10)])
df['price'] = 10,10,11,15,20,10,10,11,15,20
df['stock'] = 30,20,13,8,4,30,20,13,8,4
df['stock_under_5'] = df['stock'] < 5
df['stock_over_25'] = df['stock'] > 25
df['cum_stock_under_5'] = df['stock_under_5'].cumsum()
df['change_stock_under_5'] = df['cum_stock_under_5'].diff()
df['change_stock_under_5'].iloc[0] = df['stock_under_5'].iloc[0]*1
df['next_row_change_stock_under_5'] = df['change_stock_under_5'].shift(-1)
df['cum_stock_over_25'] = df['stock_over_25'].cumsum()
df['change_stock_over_25'] = df['cum_stock_over_25'].diff()
df['change_stock_over_25'].iloc[0] = df['stock_over_25'].iloc[0]*1
df['next_row_change_stock_over_25'] = df['change_stock_over_25'].shift(-1)
df['row'] = np.arange(df.shape[0])
df['next_row'] = df['row'].shift(-1)
df['next_row_price'] = df['price'].shift(-1)
Next we find all windows where either the stock went over 25 or below 5 by grouping over the cumulative marker of those events.
changes = (
df.groupby(['cum_stock_under_5', 'cum_stock_over_25'])
.agg({'row':'first', 'next_row':'last', 'change_stock_under_5':'max', 'change_stock_over_25':'max',
'next_row_change_stock_under_5':'max', 'next_row_change_stock_over_25':'max',
'price':'first', 'next_row_price':'last'})
.assign(price_change = lambda x: x['next_row_price'] - x['price'])
.reset_index(drop=True)
)
For each window we find what happened at the beginning of the window: if change_stock_under_5 = 1
it means the window started with the stock going under 5, if change_stock_over_25 = 1
it started with the stock going over 25.
Same for the end of the window using the columns next_row_change_stock_under_5
and next_row_change_stock_over_25
Now, we can readily extract the stock price change in rows where the stock went from being over 25 to being under 5:
from_over_to_below = changes[(changes['change_stock_over_25']==1) & (changes['next_row_change_stock_under_5']==1)]
and the other way around:
from_below_to_over = changes[(changes['change_stock_under_5']==1) & (changes['next_row_change_stock_over_25']==1)]
You can for example calculate the average price change when the stock went from over 25 to below 5:
from_over_to_below.price_change.mean()
CodePudding user response:
In order to give a better explanation, will separate the approach by creating two different functions:
The first one will be the event detection, let's call it
detect_event
.The second one will calculate the the price between the current event and the previous one, in the list generated by the first function. We will call it
calculate_price_change
.
Starting with the first function, here it is key to understand very well the goals we want to reach or the constraints/conditions we want to satisfy.
Will leave two, of more, potential options, given the various interpretations of the question:
A. The initial is what I could get from my initial understanding
B. The second part will be one of the interpretations one could get from @Iyar Lyn comment (I can see more interpretations, but won't consider in this answer as the approach will be similar).
Within option A, we will create a function to detect where a stock is under 5 or 25
def detect_event(df):
# Create a list of the indexes of the events where stock was under 5 or over 25
events = []
# Loop through the dataframe
for i in range(len(df)):
# If stock is under 5, add the index to the list
if df['stock'][i] < 5:
events.append(i)
# If stock is over 25, add the index to the list
elif df['stock'][i] > 25:
events.append(i)
# Return the list of indexes of the events where stock was under 5 or over 25
return events
The comments make it self-explanatory, but, basically, this will return a list of indexes of the rows where stock is under 5 or over 25.
With OP's df this will return
events = detect_event(df)
[Out]:
[0, 4, 5, 9]
Within the option B, assuming one wants to know the events where the stock went from under 5 to over 25, and vice-versa, consecutively (there are more ways to interpret this), then one can use the following function
def detect_event(df):
# Create a list of the indexes of the events where we will store the elements in the conditions
events = []
for i, stock in enumerate(df['stock']):
# If the index is 0, add the index of the first event to the list of events
if i == 0:
events.append(i)
# If the index is not 0, check if the stock went from over 25 to under 5 or from under 5 to over 25
else:
# If the stock went from over 25 to under 5, add the index of the event to the list of events
if stock < 5 and df['stock'][i-1] > 25:
events.append(i)
# If the stock went from under 5 to over 25, add the index of the event to the list of events
elif stock > 25 and df['stock'][i-1] < 5:
events.append(i)
# Return the list of events
return events
With OP's df this will return
events = detect_event(df)
[Out]:
[0, 5]
Note that 0
is the element in the first position, that we are appending by default.
As for the second function, once the conditions are well defined, meaning we know clearly what we want, and adapted the first function, detect_event
, accordingly, we can now detect the changes in the prices.
In order to detect the price change between the events that satisfy the conditions we defined previously, one will use a different function: calculate_price_change
.
This function will take both the dataframe df
and the list events
generated by the previous function, and return a list with the prices diferences.
def calculate_price_change(df, events):
# Create a list to store the price change between the most recent event where stock was under 5 vs the most recent event where stock was over 25
price_change = []
# Loop through the list of indexes of the events
for i, event in enumerate(events):
# If the index is 0, the price change is 0
if i == 0:
price_change.append(0)
# If the index is not 0, calculate the price change between the current and past events
else:
price_change.append(df['price'][event] - df['price'][events[i-1]])
return price_change
Now we if one calls this last function using the df
and the list created with the first function detect_event
, one gets the following
price_change = calculate_price_change(df, events)
[Out]:
[0, 10, -10, 10]
Notes:
As it is, the question gives room for multiple interpretations. That's why my initial flag for "Needs details or clarity". For the future one might want to review: How do I ask a good question? and its hyperlinks.
I understand that sometimes we won't be able to specify everything that we want (as we might not even know - due to various reasons), so communication is key. Therefore, appreciate Iyar Lin's time and contributions as they helped improve this answer.