Counting elements that satisfy a criteria in a dataset- Python/pandas-CodePudding

In a dataset, I have four columns named TP1, TP2 etc (stands for Testing Phase 1,2...). The information in these columns can be one of the following 5: Not_tested, Not_finished, too_low, a value between 150 and 190, and nothing.

An element is considered tested when it shows at least once, in any of the 4 columns, the values: too_low or a value between 150 and 190. If neither of these 2 is showed at any point in the row for that element, then it is considered not tested.

I need to report something like:

"Number of tested elements: 43 (and a percentage from all elements)" (which equates to how many rows show too_low or a value between 150 and 190)

"Number of new tested elements: # and %" (so from the elements found above, which have a YES in the NEW column?)

"Number of old tested elements: # and %" (which have a NO instead)

An example of part of the dataset is present here below:

 ======= ========== ===== ====== ============== ========= ============== ===== 
|  ID   | Location | NEW | YEAR |     PT1      |   PT2   |     PT3      | PT4 |
 ======= ========== ===== ====== ============== ========= ============== ===== 
| GF342 | Q1       | YES | 2021 |              |         |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| IF874 | Q3       | NO  | 2018 | NOT_TESTED   |         | TOO_LOW      |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| FH386 | Q1       | NO  | 2019 |              |         |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| KJ190 | Q3       | YES | 2021 | NOT_FINISHED |         | TOO_LOW      |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| TY748 | Q3       | YES | 2021 |              |         |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| YT947 | Q4       | NO  | 2019 |              | TOO_LOW |              | 165 |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| DF063 | Q3       | NO  | 2019 | 180          |         |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| ET512 | Q1       | YES | 2021 |              |         |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| GC714 | Q2       | NO  | 2018 |              | 160     |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| SD978 | Q3       | NO  | 2019 |              | TOO_LOW |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| EF472 | Q1       | NO  | 2018 |              |         |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| PL489 | Q2       | YES | 2021 |              |         | NOT_FINISHED |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| AZ315 | Q1       | NO  | 2018 | TOO_LOW      |         |              | 180 |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| OL821 | Q1       | YES | 2021 |              |         | 185          |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| HN765 | Q3       | YES | 2021 | 155          |         |              |     |
 ------- ---------- ----- ------ -------------- --------- -------------- ----- 
| ED589 | Q1       | YES | 2021 |              |         | 165          |     |
 ------- ---------- ----- ------ -------------- --------- -------------- -----

My idea to solve this problem was to use .sum() of all rows containing the necessary values, but how do I make a rule that it is only counted if present with these values? Could I maybe add a new column to the dataframe with T or F depending on if it satisfies the rule of having too_low or a number (150-190). I am quite new to all this and not sure how to go about it. I am not necessarily for a straight answer but some guidance on the problem if possible, thank you.

Dataframe:

pd.DataFrame({'ID': {0: 'GF342',  1: 'IF874',  2: 'FH386',  3: 'KJ190',  4: 'TY748',  5: 'YT947',  6: 'DF063',  7: 'ET512',  8: 'GC714',  9: 'SD978',  10: 'EF472',  11: 'PL489',  12: 'AZ315',  13: 'OL821',  14: 'HN765',  15: 'ED589'}, 'Location': {0: 'Q1',  1: 'Q3',  2: 'Q1',  3: 'Q3',  4: 'Q3',  5: 'Q4',  6: 'Q3',  7: 'Q1',  8: 'Q2',  9: 'Q3',  10: 'Q1',  11: 'Q2',  12: 'Q1',  13: 'Q1',  14: 'Q3',  15: 'Q1'}, 'NEW': {0: 'YES',  1: 'NO',  2: 'NO',  3: 'YES',  4: 'YES',  5: 'NO',  6: 'NO',  7: 'YES',  8: 'NO',  9: 'NO',  10: 'NO',  11: 'YES',  12: 'NO',  13: 'YES',  14: 'YES',  15: 'YES'}, 'YEAR': {0: 2021,  1: 2018,  2: 2019,  3: 2021,  4: 2021,  5: 2019,  6: 2019,  7: 2021,  8: 2018,  9: 2019,  10: 2018,  11: 2021,  12: 2018,  13: 2021,  14: 2021,  15: 2021}, 'PT1': {0: '',  1: 'NOT_TESTED',  2: '',  3: 'NOT_FINISHED',  4: '',  5: '',  6: '180',  7: '',  8: '',  9: '',  10: '',  11: '',  12: 'TOO_LOW',  13: '',  14: '155',  15: ''}, 'PT2': {0: '',  1: '',  2: '',  3: '',  4: '',  5: 'TOO_LOW',  6: '',  7: '',  8: '160',  9: 'TOO_LOW',  10: '',  11: '',  12: '',  13: '',  14: '',  15: ''}, 'PT3': {0: '',  1: 'TOO_LOW',  2: '',  3: 'TOO_LOW',  4: '',  5: '',  6: '',  7: '',  8: '',  9: '',  10: '',  11: 'NOT_FINISHED',  12: '',  13: '185',  14: '',  15: '165'}, 'PT4': {0: '',  1: '',  2: '',  3: '',  4: '',  5: 165.0,  6: '',  7: '',  8: '',  9: '',  10: '',  11: '',  12: 180.0,  13: '',  14: '',  15: ''}})

CodePudding user response：

I assume you are working with a DataFrame from the library Pandas. I had a similar problem and iterated through the dataframe. The variable df was a dataframe I extracted from a CSV with the method "read_csv" from Pandas. I don't know how do you get the dataset but it should be a similar format. I hope it works for you or at least you get an approach

tooLow = "TOO_LOW"
elementsCount = 0
index = len(df.columns)

for r in range(len(df.index)):
    for c in range(index):
        columnValue = df.at[r, df.columns[c]]
        if  (150 <= columnValue <= 190) or (columnValue == tooLow):
            elementsCount  = 1

print("Elements satisfying criteria: ", elementsCount)

CodePudding user response：

you can try something like this

df = pd.DataFrame([['NO',  'NOT_TESTED', None, 'TOO_LOW', None],
                  ['YES',  'NOT_FINISHED ', None, 'TOO_LOW', None],
                  ['YES', None, None, None, None],
                  ['NO', 180, None, None, None],
                  ['NO', 'TOO_LOW', None, None, 180],
                  ['YES', 155.6, None, None, None]], columns=['NEW', 'PT1', 'PT2', 'PT3', 'PT4'])

# Number of tested elements: 43 (and a percentage from all elements)"
# (which equates to how many rows show too_low or a value between 150 and 190)
filter_ = df.loc[:, 'PT1':'PT4']\
            .apply(lambda x: x.eq('TOO_LOW') | x.astype(str).str.fullmatch(r'\d \.*\d*'), axis=0)\
            .any(axis=1)

number_of_tested = filter_.sum()


#"Number of new tested elements
new_tested_elements = df[filter_ & df['NEW'].eq('YES')].shape[0]


#"Number of old tested elements
old_tested_elements = df[filter_ & df['NEW'].eq('NO')].shape[0]