I have a large dataset (~5 Mio rows) with results from a Machine Learning training. Now I want to check to see if the results hit the "target range" or not. Lets say this range contains all values between -0.25
and 0.25
. If it's inside this range, it's a Hit
, if it's below Low
and on the other side High
.
I now would create this three columns Hit, Low, High and calculate for each row which condition applies and put a 1
into this col, the other two would become 0
. After that I would group the values and sum them up. But I suspect there must be a better and faster way, such as calculate it directly while grouping. I'm happy for any idea.
Data
import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
---- -------- ---------
| | Type | Value |
|---- -------- ---------|
| 0 | RF | -1.5 | <- Low
| 1 | RF | -0.1 | <- Hit
| 2 | RF | 1.7 | <- High
| 3 | MLP | 0.2 | <- Hit
| 4 | MLP | -0.7 | <- Low
| 5 | MLP | -0.6 | <- Low
---- -------- ---------
Expected Output
pd.DataFrame({"Type":["RF", "MLP"], "Low":[1,2], "Hit":[1,1], "High":[1,0]})
---- -------- ------- ------- --------
| | Type | Low | Hit | High |
|---- -------- ------- ------- --------|
| 0 | RF | 1 | 1 | 1 |
| 1 | MLP | 2 | 1 | 0 |
---- -------- ------- ------- --------
CodePudding user response:
You could use cut
to define the groups and pivot_table
to reshape:
(df.assign(group=pd.cut(df['Value'],
[float('-inf'), -0.25, 0.25, float('inf')],
labels=['Low', 'Hit', 'High']))
.pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
.reset_index()
.rename_axis(None, axis=1)
)
Or crosstab
:
(pd.crosstab(df['Type'],
pd.cut(df['Value'],
[float('-inf'), -0.25, 0.25, float('inf')],
labels=['Low', 'Hit', 'High'])
)
.reset_index().rename_axis(None, axis=1)
)
output:
Type Low Hit High
0 MLP 2 1 0
1 RF 1 1 1
CodePudding user response:
You can assign
it with np.select
then crosstab
c1 = df.Value<=-0.25
c2 = df.Value>=0.25
out = pd.crosstab(df['Type'], np.select([c1,c2], ['Low','High'], default='Hit'))
out
Out[32]:
col_0 High Hit Low
Type
MLP 0 1 2
RF 1 1 1
CodePudding user response:
you can try this:
# Your code
import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
# Set your range
RANGE_MIN = -0.25
RANGE_MAX = 0.25
# --- define functions to be applied to df ---
# evaluate if value is a low
def eval_low(value):
if value < RANGE_MIN:
return 1
else:
return 0
# evaluate if value is a high
def eval_high(value):
if value > RANGE_MAX:
return 1
else:
return 0
# evaluate if value is a hit
def eval_hit(value):
if value >= RANGE_MIN and value <= RANGE_MAX:
return 1
else:
return 0
# Evaluate the functions in new columns
df['Low'] = df.Value.apply(eval_low)
df['Hit'] = df.Value.apply(eval_hit)
df['High'] = df.Value.apply(eval_high)
# get the summary
df.groupby('Type').sum()