Group and create three new columns by condition [Low, Hit, High] (pandas equivalent in R)-CodePudding

I want to adapt some parts of my python code to R and don't know how to solve it there.

I have a large dataset (~5 Mio rows) with results from some ML trainings. Now I want to qualify the results by defining if they hit the "target range". This "target range" contains all values between -0.25 and 0.25. If it's inside this range, it's a Hit, if it's below Low and High on the other side.

df = data.frame(Type=c("RF", "RF", "RF", "MLP", "MLP", "MLP"),
               Value=c(-1.5, -0.1, 1.7, 0.2, -0.7, -0.6))
                
df

 -------- --------- 
| Type   |   Value |
 -------- ---------|
| RF     |    -1.5 | <- Low
| RF     |    -0.1 | <- Hit
| RF     |     1.7 | <- High
| MLP    |     0.2 | <- Hit
| MLP    |    -0.7 | <- Low
| MLP    |    -0.6 | <- Low
 -------- ---------

In python python pandas I could solve it with:

(df.assign(group=pd.cut(df['Value'],
               [float('-inf'), -0.25, 0.25, float('inf')],
                        labels=['Low', 'Hit', 'High']))
   .pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
   .reset_index()
   .rename_axis(None, axis=1)
)

Can somebody help me how I could adapt this to R? Whether native R, data.table or dplyr doesn't matter, I'm open for any approach. Nevertheless as faster as better :-)

Expected output

df_expected = data.frame(Type=c("RF", "MLP"), "Low"=c(1,2), "Hit"=c(1,1), "High"=c(1,0))

 -------- ------- ------- -------- 
| Type   |   Low |   Hit |   High |
 -------- ------- ------- --------|
| RF     |     1 |     1 |      1 |
| MLP    |     2 |     1 |      0 |
 -------- ------- ------- --------

CodePudding user response：

The base::cut() command can do this:

 df$Cut<-cut(df$Value, breaks = c(-Inf, -0.25, 0.25, Inf), labels = c("Low", "Hit", "High"), )

 df
  Type Value  Cut
1   RF  -1.5  Low
2   RF  -0.1  Hit
3   RF   1.7 High
4  MLP   0.2  Hit
5  MLP  -0.7  Low
6  MLP  -0.6  Low

 table(df$Type, df$Cut)
     
      Low Hit High
  MLP   2   1    0
  RF    1   1    1

Edit: same output, made with pivot_wider

df %>% 
  count(Type, Cut) %>% 
  pivot_wider(names_from = Cut, values_from = n, values_fill = 0)

Edit 2: the dplyr chain:

df %>% 
  mutate(Cut = cut(Value, 
                   breaks = c(-Inf, -0.25, 0.25, Inf), 
                   labels = c("Low", "Hit", "High"), )) %>% 
  count(Type, Cut) %>% 
  pivot_wider(names_from = Cut, 
              values_from = n,
              values_fill = 0)

CodePudding user response：

Here is a totally different approach using dplyover::over() with a named recode list. The recoding part is a bit more laborious compared to cut() but we can skip the data rectangling with pivot_wider().

library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/
# disclaimer: I'm the maintainer and its not on CRAN

df = data.frame(Type=c("RF", "RF", "RF", "MLP", "MLP", "MLP"),
                Value=c(-1.5, -0.1, 1.7, 0.2, -0.7, -0.6))

df %>% 
  group_by(Type) %>% 
  summarise(over(list(Low  = c(-Inf, -0.25), 
                      Hit  = c(-0.25, 0.25),
                      High = c(0.25, Inf)),
                 ~ sum(Value > .x[1] & Value < .x[2])
                 )
  )
#> # A tibble: 2 × 4
#>   Type    Low   Hit  High
#>   <chr> <int> <int> <int>
#> 1 MLP       2     1     0
#> 2 RF        1     1     1

^{Created on 2022-10-15 by the reprex package (v0.3.0)}

CodePudding user response：

I see that your data are somewhat big. I would use data.table for what you need as you'll get results much faster.

This is how I'd construct a script in order to achieve what you want. data.table's optimised functions fcase (works like SQL's CASE WHEN) and dcast are utilised here while I also create the class error to account for observation that do not fall under your predefined conditions.

library(data.table)

df = data.frame(Type=c("RF", "RF", "RF", "MLP", "MLP", "MLP"),
                Value=c(-1.5, -0.1, 1.7, 0.2, -0.7, -0.6))

df = as.data.table(df)

df[, class := fcase(Value >= -0.25 & Value <= 0.25, "Hit",
                    Value < -0.25, "Low",
                    Value > 0.25, "High",
                    default = "error")]

dcast(df, Type ~ class, value.var = "class")

Output:

   Type High Hit Low
1:  MLP    0   1   2
2:   RF    1   1   1