Group and create three new columns by condition [Low, Hit, High]-CodePudding

I have a large dataset (~5 Mio rows) with results from a Machine Learning training. Now I want to check to see if the results hit the "target range" or not. Lets say this range contains all values between -0.25 and 0.25. If it's inside this range, it's a Hit, if it's below Low and on the other side High.

I now would create this three columns Hit, Low, High and calculate for each row which condition applies and put a 1 into this col, the other two would become 0. After that I would group the values and sum them up. But I suspect there must be a better and faster way, such as calculate it directly while grouping. I'm happy for any idea.

Data

import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

 ---- -------- --------- 
|    | Type   |   Value |
|---- -------- ---------|
|  0 | RF     |    -1.5 | <- Low
|  1 | RF     |    -0.1 | <- Hit
|  2 | RF     |     1.7 | <- High
|  3 | MLP    |     0.2 | <- Hit
|  4 | MLP    |    -0.7 | <- Low
|  5 | MLP    |    -0.6 | <- Low
 ---- -------- ---------

Expected Output

pd.DataFrame({"Type":["RF", "MLP"], "Low":[1,2], "Hit":[1,1], "High":[1,0]})

 ---- -------- ------- ------- -------- 
|    | Type   |   Low |   Hit |   High |
|---- -------- ------- ------- --------|
|  0 | RF     |     1 |     1 |      1 |
|  1 | MLP    |     2 |     1 |      0 |
 ---- -------- ------- ------- --------

CodePudding user response：

You could use cut to define the groups and pivot_table to reshape:

(df.assign(group=pd.cut(df['Value'],
                        [float('-inf'), -0.25, 0.25, float('inf')],
                        labels=['Low', 'Hit', 'High']))
   .pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
   .reset_index()
   .rename_axis(None, axis=1)
)

Or crosstab:

(pd.crosstab(df['Type'],
             pd.cut(df['Value'],
                    [float('-inf'), -0.25, 0.25, float('inf')],
                    labels=['Low', 'Hit', 'High'])
             )
   .reset_index().rename_axis(None, axis=1)
 )

output:

  Type  Low  Hit  High
0  MLP    2    1     0
1   RF    1    1     1

CodePudding user response：

You can assign it with np.select then crosstab

c1 = df.Value<=-0.25
c2 = df.Value>=0.25
out = pd.crosstab(df['Type'], np.select([c1,c2], ['Low','High'], default='Hit'))
out
Out[32]: 
col_0  High  Hit  Low
Type                 
MLP       0    1    2
RF        1    1    1

CodePudding user response：

you can try this:

# Your code
import pandas as pd

df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})

# Set your range 
RANGE_MIN = -0.25
RANGE_MAX = 0.25

# --- define functions to be applied to df ---
# evaluate if value is a low
def eval_low(value):
    if value < RANGE_MIN:
        return 1
    else:
        return 0

# evaluate if value is a high
def eval_high(value):
    if value > RANGE_MAX:
        return 1
    else:
        return 0

# evaluate if value is a hit
def eval_hit(value):
    if value >= RANGE_MIN and value <= RANGE_MAX:
        return 1
    else:
        return 0

# Evaluate the functions in new columns
df['Low'] = df.Value.apply(eval_low)
df['Hit'] = df.Value.apply(eval_hit)
df['High'] = df.Value.apply(eval_high)

# get the summary
df.groupby('Type').sum()