I have a pandas dataframe like
index start end label
0 2 5 0
1 3 8 1
2 4 8 0
3 5 9 1
4 6 10 0
5 7 10 1
6 8 11 1
7 9 12 0
I want a new column 'mean'; where the value is the mean of the previous rows label
with the condition df['start']<df['end']
Example,
for index 1, df['mean'] = (df[0]['label'] df[1]['label'])/2
for index 3, df['mean'] = (df[1]['label'] df[2]['label'] df[3]['label'])/3
; here we ignore index 0 as df[3]['start']<df[0]['end']
condition does not satisfy.
similarly, for index 7, df['mean'] = (df[4]['label'] df[5]['label'] df[6]['label'] df[7]['label'])/4
; as for index 0,1,2,3; df[7]['start']<df[i]['end']
condition does not satisfy.
So the final output would be
index start end label mean
0 2 5 0 0
1 3 8 1 1/2
2 4 8 0 1/3
3 5 9 1 2/3
4 6 10 0 2/4
5 7 10 1 3/5
6 8 11 1 3/4
7 9 12 0 2/4
I was trying using cumsum
; but I am not sure how to put the condition.
CodePudding user response:
Here is a less performant solution (looping over each row should generally be avoided in Pandas) but one that is hopefully accessible as a starting point that you can then optimize:
df = pd.DataFrame([
[2,5,0],
[3,8,1],
[4,8,0],
[5,9,1],
[6,10,0],
[7,10,1],
[8,11,1],
[9,12,0]],columns=['start','end','label'])
for index, row in df.iterrows():
if index == 0:
df.at[index, 'cumulative_mean'] = 0
else:
current_row_start = row['start']
previous_rows_as_df = df.loc[0:index] # create a DF which is all the previous rows
for p_index, p_row in previous_rows_as_df.iterrows():
if current_row_start < p_row['end']:
previous_rows_as_df.at[p_index, 'include'] = True
df.at[index, 'cumulative_mean'] = previous_rows_as_df[previous_rows_as_df['include'] == True]['label'].mean()
CodePudding user response:
Here is your result.
mask_matrix = (
(df.start.to_numpy().reshape(1,-1).T < df.end.to_numpy())
& (df.index.to_numpy() <= np.arange(0,len(df)).reshape(1, -1).T)
)
df_add = pd.DataFrame(
(np.matmul(
(
(mask_matrix)
), df.label.to_numpy()
)
) / (mask_matrix.sum(axis=-1)),
columns = ["mean"]
)
df = pd.concat([df, df_add], axis=1)
When we create the matrix we use O(n^2) of additional space. Hopefully it is not a problem. Otherwise need to use a loop which I don't personally like when using vectorized computations.
A few additional comments: df.start.to_numpy().reshape(1,-1).T < df.end.to_numpy() basically compares where start is below end for each row. This is the result:
array([[ True, True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True, True],
[False, True, True, True, True, True, True, True, True],
[False, True, True, True, True, True, True, True, True],
[False, True, True, True, True, True, True, True, True],
[False, False, False, True, True, True, True, True, False],
[False, False, False, False, True, True, True, True, False],
[False, False, False, False, True, True, True, True, False]])
(df.index.to_numpy() <= np.arange(0,len_).reshape(1, -1).T) restricts previous result to only rows that are precedent to current one. This mask looks like this:
array([[ True, False, False, False, False, False, False, False, False],
[ True, True, False, False, False, False, False, False, False],
[ True, True, True, False, False, False, False, False, False],
[ True, True, True, True, False, False, False, False, False],
[ True, True, True, True, True, False, False, False, False],
[ True, True, True, True, True, True, False, False, False],
[ True, True, True, True, True, True, True, False, False],
[ True, True, True, True, True, True, True, True, False],
[ True, True, True, True, True, True, True, True, True]])
Final mask_matrix (elementwise multiplication of previous two matrices) looks like this
array([[ True, False, False, False, False, False, False, False, False],
[ True, True, False, False, False, False, False, False, False],
[ True, True, True, False, False, False, False, False, False],
[False, True, True, True, False, False, False, False, False],
[False, True, True, True, True, False, False, False, False],
[False, True, True, True, True, True, False, False, False],
[False, False, False, True, True, True, True, False, False],
[False, False, False, False, True, True, True, True, False],
[False, False, False, False, True, True, True, True, False]])
Now multiplying this mask_matrix by vector df.label gives almost what we need. Just need to elementwise divide by the sum of True in mask_matrix