Below, I have a table where columns TST1
to TST5
can take no values or one of the following: NOT_DONE
INCOMP
UNTESTED
30
35
40
45
50
I need to count the number of elements (rows) that are validated from the table below.
An element is considered validated when the right-most value is one between 30 and 50 (seperated by 5, so 30, 35, 40...). This means that if the row has no value for all TST1
to TST5
then nothing is counted. If a numerical value is found to the left of either NOT_DONE
INCOMP
or UNTESTED
then it is not validated.
In other words, I need to count each row right to left.
For example from the table below, only 6 elements are considered validated.
Finally, I need to count how many of these belong to group A or B.
My initial idea to solve this was to create a new column with all the validated elements but I'm really not sure how to do this.
I am using python 2.7 and pandas 0.24.2. I am new to this, any help or guidance is greatly appreciated.
------- ---------- ---------- ---------- -------- ----------
| Group | TST1 | TST2 | TST3 | TST4 | TST5 |
------- ---------- ---------- ---------- -------- ----------
| A | | NOT_DONE | | | 50 |
------- ---------- ---------- ---------- -------- ----------
| A | | | 35 | | |
------- ---------- ---------- ---------- -------- ----------
| B | | | | | |
------- ---------- ---------- ---------- -------- ----------
| A | | | INCOMP | | |
------- ---------- ---------- ---------- -------- ----------
| B | UNTESTED | | 50 | INCOMP | |
------- ---------- ---------- ---------- -------- ----------
| B | | | | | |
------- ---------- ---------- ---------- -------- ----------
| B | | 30 | | | |
------- ---------- ---------- ---------- -------- ----------
| A | | INCOMP | 40 | | |
------- ---------- ---------- ---------- -------- ----------
| B | | | | | UNTESTED |
------- ---------- ---------- ---------- -------- ----------
| A | | | | | |
------- ---------- ---------- ---------- -------- ----------
| B | | INCOMP | | | |
------- ---------- ---------- ---------- -------- ----------
| A | | | | | |
------- ---------- ---------- ---------- -------- ----------
| B | | 50 | | | |
------- ---------- ---------- ---------- -------- ----------
| B | | | UNTESTED | 35 | NOT_DONE |
------- ---------- ---------- ---------- -------- ----------
| B | | | | | |
------- ---------- ---------- ---------- -------- ----------
| A | | 40 | | INCOMP | |
------- ---------- ---------- ---------- -------- ----------
| A | | | | 30 | |
------- ---------- ---------- ---------- -------- ----------
| B | | | | | |
------- ---------- ---------- ---------- -------- ----------
| B | | NOT_DONE | | 30 | NOT_DONE |
------- ---------- ---------- ---------- -------- ----------
Edit: This is what I have tried but it counts all rows presenting a numerical value, not the ones with right-most value being numerical. I really have no idea on how to select going from the right first.
filter1 = df.loc[:, 'TST1':'TST5']\
.apply(lambda x: x.astype(str).str.match(r'\d \.*\d*'), axis=0)\
.any(axis=1)
number_validated = filter1.sum()
print "Number of validated items: ", number_validated
Expected output should just be a short textual summary:
Number of validated items: 5
Number of group A validated items: 4
Number of group B validated items: 2
CodePudding user response:
Another option, tested on python 2.7.18 and pandas 0.24.2 (though it works fine in python 3):
Use
ffill
to extract the rightmost values andto_numeric
to coerce them into numbers:rightmost = df.filter(like='TST').ffill(axis='columns').iloc[:, -1] rightmost = pd.to_numeric(rightmost, errors='coerce') # 0 NaN # 1 35.0 # 2 NaN # 3 NaN # 4 NaN # 5 NaN # 6 30.0 # 7 40.0 # 8 NaN # 9 NaN # 10 NaN # 11 NaN # 12 50.0 # 13 NaN # 14 NaN # 15 NaN # 16 30.0 # 17 NaN # 18 NaN # Name: TST5, dtype: float64
Then
groupby
theGroup
and check if they arebetween
30 and 50 (inclusive):valid = rightmost.groupby(df.Group).apply( lambda g: g.between(30, 50, inclusive='both').sum() ).to_frame('Valid') # Valid # Group # A 3 # B 2