I have a Data Frame that looks like this:
Answers,all_answers,Score
1.0,1,1
0.0,0,1
0.0,0,2
0.0,0,3
-1.0,1,1
0.0,0,1
0.0,0,2
1.0,1,1
-1.0,1,2
0.0,0,1
1.0,1,1
Each line is a second
The answers column means with which sign the calculation was made at this second: 1 is a positive number -1 is a negative number
The All_Answers column is the answers column modulo that is, without negative signs. It was made for the third column of the internal Score account. Score counts how many times the value 0 or 1 went in a row.
Data Frame can consist of 5000 rows
And the internal account column of the value 0 can reach an average of 300 values
I want to collect the values of All_answers in a new column with the condition that between them the value 0 did not exceed 5.While observing the Answers sign.
For example
Answers,all_answers,Score,New
1.0,1,1,1
0.0,0,1,0
0.0,0,2,0
0.0,0,3,0
0.0,0,4,0
0.0,0,5,0
0.0,0,6,0
-1.0,1,1,0
0.0,0,1,0
0.0,0,2,0
1.0,1,1,1
-1.0,1,2,-1
0.0,0,1,0
1.0,1,1,1
There may be such values:
Answers,all_answers,Score,New
1.0,1,1,1
0.0,0,1,0
-1.0,1,1,-1
0.0,0,1,0
1.0,1,1,1
0.0,0,1,0
-1.0,1,1,-1
0.0,0,1,0
1.0,1,1,1
0.0,0,1,0
-1.0,1,1,-1
0.0,0,1,0
0.0,0,2,0
1.0,1,1,1
And it can be
Answers,all_answers,Score,New
0.0,0,19,0
0.0,0,20,0
0.0,0,21,0
0.0,0,22,0
0.0,0,23,0
0.0,0,24,0
0.0,0,25,0
0.0,0,26,0
-1.0,1,1,0
0.0,0,1,0
0.0,0,2,0
0.0,0,3,0
0.0,0,4,0
0.0,0,5,0
0.0,0,6,0
1.0,1,1,0
And very rarely can this happen. But this is exactly the aggregate I'm trying to find and I want to see it on the chart.
Answers,all_answers,Score,New
1.0,1,1,0
0.0,0,1,0
0.0,0,2,0
0.0,0,3,0
0.0,0,4,0
0.0,0,5,0
-1.0,1,1,-1
1.0,1,2,1
-1.0,1,3,-1
1.0,1,4,1
-1.0,1,5,-1
0.0,0,1
0.0,0,2
0.0,0,3
0.0,0,4
1.0,1,1,1
0.0,0,1
0.0,0,2
0.0,0,3
0.0,0,4
0.0,0,5
0.0,0,6
0.0,0,7
I already have a script that finds values that have been in a row five times That is, he just collects such values but does not see the values that were two seconds later
Here it is
s = np.sign(df['all_answers'])
group = s.ne(s.shift()).cumsum()
df['Score'] = s.groupby(group).cumcount().add(1)
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
X1 = int(x)
m = df.groupby(g)['all_answers'].transform('size').ge(5)
df['New'] = df['Answers'].where(m, 0)
That's what it outputs
Answers,all_answers,Score,New
1.0,1,1,0.0
0.0,0,1,0.0
0.0,0,2,0.0
0.0,0,3,0.0
0.0,0,4,0.0
0.0,0,5,0.0
-1.0,1,1,-1
1.0,1,2,1.0
-1.0,1,3,-1
1.0,1,4,1.0
-1.0,1,5,-1
The logic of creating a new column is as follows: that he will collect the values 1 and -1 from Answers between which there is no distance where the Score column has raised 5
CodePudding user response:
IIUC, something as follows should work.
Data
import pandas as pd
import numpy as np
data = {'Answers': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: -1, 8: 0,
9: 0, 10: 1, 11: -1, 12: 0, 13: 1},
'all_answers': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 1,
8: 0, 9: 0, 10: 1, 11: 1, 12: 0, 13: 1},
'Score': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 1, 8: 1,
9: 2, 10: 6, 11: 2, 12: 1, 13: 1}}
df = pd.DataFrame(data)
# notice that I've given Answer `1` at index value `10` a Score of `6`.
# This should *not* affect `New` for the next answer.
print(df.iloc[10:12])
Answers all_answers Score
10 1 1 6
11 -1 1 2
Problem
Each answer (-1,1)
in col Answers
may be preceded by one of more rows where df.Answers == 0
. If, in such an event the preceding rows with 0
contain an associated Score
value that exceeds 5
, then the following answer (so: either -1
or 1
) should get a 0
in col New
. In all other scenarios, col New
should simply take the value from col Answers
.
Solution
# add col `groups` with each consecutive value in `Answers` as one group
df['groups'] = (df.Answers != df.Answers.shift()).cumsum()
# add col `mask` with 1 for each `Answers` value `(-1,1)`,
# and the actual df.Score for all `0` vals
df['mask'] = np.where(df.Answers.ne(0),1,df.Score)
# now groupby `groups` and get the max `mask` val for each group,
# check if this val exceeds 5 (`gt(5)`) to get a boolean Series
# we shift the series so that it can be mapped onto the *following* group
# (and we `fillna` for the very first group:
# it will always need a `False` (never preceded by `0s` of course))
# finally, we map this result to col `groups`, and use the boolean Series
# inside `np.where` for evaluation. If `False` we want `0`, else `df.Answers`
df['New'] = np.where(
df['groups'].map(
df.groupby('groups')['mask'].max().gt(5).shift().fillna(False)),
0,
df.Answers)
# drop the temp cols `groups` and `mask`
df.drop(['groups','mask'], axis=1, inplace=True)
print(df)
Answers all_answers Score New
0 1 1 1 1
1 0 0 1 0
2 0 0 2 0
3 0 0 3 0
4 0 0 4 0
5 0 0 5 0
6 0 0 6 0
7 -1 1 1 0
8 0 0 1 0
9 0 0 2 0
10 1 1 6 1
11 -1 1 2 -1
12 0 0 1 0
13 1 1 1 1
Check:
print(df.loc[df.Answers.ne(0)])
Answers all_answers Score New
0 1 1 1 1
7 -1 1 1 0
10 1 1 6 1
11 -1 1 2 -1
13 1 1 1 1
So, all answers should be OK in New
, except -1
at index value 7
, since this is preceded by consecutive 0
rows with a value exceeding 5
, namely:
print(df.iloc[1:8])
Answers all_answers Score New
1 0 0 1 0
2 0 0 2 0
3 0 0 3 0
4 0 0 4 0
5 0 0 5 0
6 0 0 6 0
7 -1 1 1 0