how to categorize sum of score from a numpy array having a minimum threshold-CodePudding

I have a numpy 2D array of 50 patients and 100 score data points.

scores = array([[7.0, 10.0, 12.0, ..., 0.0],
[0.0, 11.0, 34.0, ..., 1.0],
.
.
.
[0.0, 33.0, 34.0, ..., 50.0]])

each score is a non-negative float value that will be mapped to a category {a, b, c}( which stand for mild, moderate, sever) according to range condition {v < 20: 'A', 20 <= v <= 50 : 'B', 50 <= v : 'C'}. This can be done using ((25 < a) & (a < 100)).sum() as in this thread.

Now I need to assign each patient a category, based on the maximum score he received, provided that the count of the category data points is >= certain threshold (say 20%).

For example (taking 20% out of 100 data points as threshold):

if patient i scored 25 data points of severity 'C' -> he is categorized as C (severe)
if patient i scored 15 data points of severity 'C' and 15 data points of severity 'B' -> he is categorized as B (moderate).

Is there a way to do that automatically in numpy?

Thank you in advance.

Update: expected output should be 1D array of the same length of number or patients (50,) in the form categories = ['A', 'C', 'A', .... 'B'], where each value is the overall category of the patient.

CodePudding user response：

mapping the values

You can use numpy.select:

scores = np.array([[7.0, 10.0, 12.0, 0.0],
                   [0.0, 11.0, 34.0, 55],
                   [55,55,0,44],
                   ])

out = np.select([scores<20, (20<=scores)&(scores<50), 50<=scores],
                ['A', 'B', 'C'])

output:

array([['A', 'A', 'A', 'A'],
       ['A', 'A', 'B', 'C'],
       ['C', 'C', 'A', 'B']], dtype='<U3')

getting the most frequent

Here use numpy.unique:

categories np.unique(out, axis=1)[:,0]

output:

array(['A', 'A', 'C'], dtype='<U3')

CodePudding user response：

I made it in one step

data = get_the_data()
data[:, :-1].sort()  # sort the data descending along the last dimension.
data_categorized = data[:, 20]  # Threshold is 20% at least
# Now I can categorize directly
out = np.select([data<20, (20<=data)&(data<50), 50<=data], ['A', 'B', 'C'])

Instead of categorizing each data point then categorizing the patient as a whole based on at least 20% severity threshold, I sorted the array descending, then I took the item number 20 (out of 100).

Being sorted descending, I am sure when I take the item # 20 that all the items before it are of equal or higher severity.