Home > front end >  Counting elements based on placement in dataframe
Counting elements based on placement in dataframe

Time:09-17

Below, I have a table where columns TST1 to TST5 can take no values or one of the following: NOT_DONE INCOMP UNTESTED 30 35 40 45 50

I need to count the number of elements (rows) that are validated from the table below.

An element is considered validated when the right-most value is one between 30 and 50 (seperated by 5, so 30, 35, 40...). This means that if the row has no value for all TST1 to TST5 then nothing is counted. If a numerical value is found to the left of either NOT_DONE INCOMP or UNTESTED then it is not validated.

In other words, I need to count each row right to left.

For example from the table below, only 6 elements are considered validated.

Finally, I need to count how many of these belong to group A or B.

My initial idea to solve this was to create a new column with all the validated elements but I'm really not sure how to do this.

I am using python 2.7 and pandas 0.24.2. I am new to this, any help or guidance is greatly appreciated.

 ------- ---------- ---------- ---------- -------- ---------- 
| Group | TST1     | TST2     | TST3     | TST4   | TST5     |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          | NOT_DONE |          |        | 50       |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          |          | 35       |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          |          |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          |          | INCOMP   |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     | UNTESTED |          | 50       | INCOMP |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          |          |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          | 30       |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          | INCOMP   | 40       |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          |          |          |        | UNTESTED |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          |          |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          | INCOMP   |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          |          |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          | 50       |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          |          | UNTESTED | 35     | NOT_DONE |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          |          |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          | 40       |          | INCOMP |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| A     |          |          |          | 30     |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          |          |          |        |          |
 ------- ---------- ---------- ---------- -------- ---------- 
| B     |          | NOT_DONE |          | 30     | NOT_DONE |
 ------- ---------- ---------- ---------- -------- ---------- 

Edit: This is what I have tried but it counts all rows presenting a numerical value, not the ones with right-most value being numerical. I really have no idea on how to select going from the right first.

    filter1 = df.loc[:, 'TST1':'TST5']\
        .apply(lambda x: x.astype(str).str.match(r'\d \.*\d*'), axis=0)\
        .any(axis=1)
    number_validated = filter1.sum()
    print "Number of validated items: ", number_validated

Expected output should just be a short textual summary:

Number of validated items: 5
Number of group A validated items: 4
Number of group B validated items: 2

CodePudding user response:

Another option, tested on python 2.7.18 and pandas 0.24.2 (though it works fine in python 3):

  1. Use ffill to extract the rightmost values and to_numeric to coerce them into numbers:

    rightmost = df.filter(like='TST').ffill(axis='columns').iloc[:, -1]
    rightmost = pd.to_numeric(rightmost, errors='coerce')
    
    # 0      NaN
    # 1     35.0
    # 2      NaN
    # 3      NaN
    # 4      NaN
    # 5      NaN
    # 6     30.0
    # 7     40.0
    # 8      NaN
    # 9      NaN
    # 10     NaN
    # 11     NaN
    # 12    50.0
    # 13     NaN
    # 14     NaN
    # 15     NaN
    # 16    30.0
    # 17     NaN
    # 18     NaN
    # Name: TST5, dtype: float64
    
  2. Then groupby the Group and check if they are between 30 and 50 (inclusive):

    valid = rightmost.groupby(df.Group).apply(
        lambda g: g.between(30, 50, inclusive='both').sum()
    ).to_frame('Valid')
    
    #        Valid
    # Group       
    # A          3
    # B          2
    
  • Related