I want to adapt some parts of my python code to R and don't know how to solve it there.
I have a large dataset (~5 Mio rows) with results from some ML trainings. Now I want to qualify the results by defining if they hit the "target range". This "target range" contains all values between -0.25
and 0.25
. If it's inside this range, it's a Hit
, if it's below Low
and High
on the other side.
df = data.frame(Type=c("RF", "RF", "RF", "MLP", "MLP", "MLP"),
Value=c(-1.5, -0.1, 1.7, 0.2, -0.7, -0.6))
df
-------- ---------
| Type | Value |
-------- ---------|
| RF | -1.5 | <- Low
| RF | -0.1 | <- Hit
| RF | 1.7 | <- High
| MLP | 0.2 | <- Hit
| MLP | -0.7 | <- Low
| MLP | -0.6 | <- Low
-------- ---------
In python python pandas I could solve it with:
(df.assign(group=pd.cut(df['Value'],
[float('-inf'), -0.25, 0.25, float('inf')],
labels=['Low', 'Hit', 'High']))
.pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
.reset_index()
.rename_axis(None, axis=1)
)
Can somebody help me how I could adapt this to R? Whether native R
, data.table
or dplyr
doesn't matter, I'm open for any approach. Nevertheless as faster as better :-)
Expected output
df_expected = data.frame(Type=c("RF", "MLP"), "Low"=c(1,2), "Hit"=c(1,1), "High"=c(1,0))
-------- ------- ------- --------
| Type | Low | Hit | High |
-------- ------- ------- --------|
| RF | 1 | 1 | 1 |
| MLP | 2 | 1 | 0 |
-------- ------- ------- --------
CodePudding user response:
The base::cut()
command can do this:
df$Cut<-cut(df$Value, breaks = c(-Inf, -0.25, 0.25, Inf), labels = c("Low", "Hit", "High"), )
df
Type Value Cut
1 RF -1.5 Low
2 RF -0.1 Hit
3 RF 1.7 High
4 MLP 0.2 Hit
5 MLP -0.7 Low
6 MLP -0.6 Low
table(df$Type, df$Cut)
Low Hit High
MLP 2 1 0
RF 1 1 1
Edit: same output, made with pivot_wider
df %>%
count(Type, Cut) %>%
pivot_wider(names_from = Cut, values_from = n, values_fill = 0)
Edit 2: the dplyr chain:
df %>%
mutate(Cut = cut(Value,
breaks = c(-Inf, -0.25, 0.25, Inf),
labels = c("Low", "Hit", "High"), )) %>%
count(Type, Cut) %>%
pivot_wider(names_from = Cut,
values_from = n,
values_fill = 0)
CodePudding user response:
Here is a totally different approach using dplyover::over()
with a named recode list. The recoding part is a bit more laborious compared to cut()
but we can skip the data rectangling with pivot_wider()
.
library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/
# disclaimer: I'm the maintainer and its not on CRAN
df = data.frame(Type=c("RF", "RF", "RF", "MLP", "MLP", "MLP"),
Value=c(-1.5, -0.1, 1.7, 0.2, -0.7, -0.6))
df %>%
group_by(Type) %>%
summarise(over(list(Low = c(-Inf, -0.25),
Hit = c(-0.25, 0.25),
High = c(0.25, Inf)),
~ sum(Value > .x[1] & Value < .x[2])
)
)
#> # A tibble: 2 × 4
#> Type Low Hit High
#> <chr> <int> <int> <int>
#> 1 MLP 2 1 0
#> 2 RF 1 1 1
Created on 2022-10-15 by the reprex package (v0.3.0)
CodePudding user response:
I see that your data are somewhat big. I would use data.table
for what you need as you'll get results much faster.
This is how I'd construct a script in order to achieve what you want. data.table
's optimised functions fcase
(works like SQL's CASE WHEN) and dcast
are utilised here while I also create the class error
to account for observation that do not fall under your predefined conditions.
library(data.table)
df = data.frame(Type=c("RF", "RF", "RF", "MLP", "MLP", "MLP"),
Value=c(-1.5, -0.1, 1.7, 0.2, -0.7, -0.6))
df = as.data.table(df)
df[, class := fcase(Value >= -0.25 & Value <= 0.25, "Hit",
Value < -0.25, "Low",
Value > 0.25, "High",
default = "error")]
dcast(df, Type ~ class, value.var = "class")
Output:
Type High Hit Low
1: MLP 0 1 2
2: RF 1 1 1