Use case_when in piped workflow over large number of criteria without rowwise-CodePudding

I would like to use a piped workflow to apply dplyr::case_when over a dataset of 10^4s of rows regularly, and would like to avoid rowwise as it is v slow.

I would like to match a large number of criteria to a given values, e.g. any(x = c(1:999, 3000:200000, 250000:250100), where length(x) is 1, and apply it to each row in a data.frame.

Something like this function but with many more criteria:

is_good_car <- function(x){
  any(
    x == c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
    )
  )
}

I could apply it like this:

library(dplyr)

mtcars %>%
  mutate(
    car = rownames(.)
  ) %>%
  as_tibble %>%
  mutate(
    good_cars = case_when(
      is_good_car(car) ~ "good",
      TRUE ~ "rubbish"
    )
  ) %>%
  select(car, good_cars)
#> Warning in x == c("Mazda RX4", "Datsun 710", "Valiant"): longer object length is
#> not a multiple of shorter object length
#> # A tibble: 32 x 2
#>    car               good_cars
#>    <chr>             <chr>    
#>  1 Mazda RX4         good     
#>  2 Mazda RX4 Wag     good     
#>  3 Datsun 710        good     
#>  4 Hornet 4 Drive    good     
#>  5 Hornet Sportabout good     
#>  6 Valiant           good     
#>  7 Duster 360        good     
#>  8 Merc 240D         good     
#>  9 Merc 230          good     
#> 10 Merc 280          good     
#> # ... with 22 more rows

But this doesn't work because it just returns a single TRUE from is_good_car and returns this to every row.

I can use rowwise to get the right answer but it's v slow for my purpose:

mtcars %>%
  mutate(
    car = rownames(.)
  ) %>%
  as_tibble %>%
  rowwise %>%
  mutate(
    good_cars = case_when(
      is_good_car(car) ~ "good",
      TRUE ~ "rubbish"
    )
  ) %>%
  select(car, good_cars)
#> # A tibble: 32 x 2
#> # Rowwise: 
#>    car               good_cars
#>    <chr>             <chr>    
#>  1 Mazda RX4         good     
#>  2 Mazda RX4 Wag     rubbish  
#>  3 Datsun 710        good     
#>  4 Hornet 4 Drive    rubbish  
#>  5 Hornet Sportabout rubbish  
#>  6 Valiant           good     
#>  7 Duster 360        rubbish  
#>  8 Merc 240D         rubbish  
#>  9 Merc 230          rubbish  
#> 10 Merc 280          rubbish  
#> # ... with 22 more rows

I could use sapply also but I want to work it into a piped workflow like above:


sapply(
  X = rownames(mtcars),
  FUN = is_good_car
)
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>                TRUE               FALSE                TRUE               FALSE 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>               FALSE                TRUE               FALSE               FALSE 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>               FALSE               FALSE               FALSE               FALSE 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>               FALSE               FALSE               FALSE               FALSE 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>               FALSE               FALSE               FALSE               FALSE 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>               FALSE               FALSE               FALSE               FALSE 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>               FALSE               FALSE               FALSE               FALSE 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>               FALSE               FALSE               FALSE               FALSE

Are there any options to use function like is_good_car as desired, within case_when without using rowwise?

^{Created on 2021-09-22 by the reprex package (v2.0.1)}

CodePudding user response：

For comparing multiple values you should use %in%.

is_good_car <- function(x){
    x %in% c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
  )
}

then you can use it without rowwise -

library(dplyr)

mtcars %>%
  mutate(
    car = rownames(.)
  ) %>%
  as_tibble %>%
  mutate(
    good_cars = case_when(is_good_car(car) ~ "good",
      TRUE ~ "rubbish"
    )
  ) %>%
  select(car, good_cars)

#   car               good_cars
#   <chr>             <chr>    
# 1 Mazda RX4         good     
# 2 Mazda RX4 Wag     rubbish  
# 3 Datsun 710        good     
# 4 Hornet 4 Drive    rubbish  
# 5 Hornet Sportabout rubbish  
# 6 Valiant           good     
# 7 Duster 360        rubbish  
# 8 Merc 240D         rubbish  
# 9 Merc 230          rubbish  
#10 Merc 280          rubbish  
# … with 22 more rows

CodePudding user response：

If you change == to %in% like so:

is_good_car <- function(x){
  any(
    x %in% c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
    )
  )
}

The function is now vectorized and will be fast and not need rowwise.

CodePudding user response：

We could use map and it should be faster than rowwise

library(purrr)
library(tibble)
library(dplyr)
mtcars %>%
    rownames_to_column('car') %>%
     mutate(good_cars = case_when(map_lgl(car, is_good_car) ~ 
          'good', TRUE ~ 'rubbish'))  %>%
     as_tibble

-output

# A tibble: 32 x 13
   car                 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb good_cars
   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>    
 1 Mazda RX4          21       6  160    110  3.9   2.62  16.5     0     1     4     4 good     
 2 Mazda RX4 Wag      21       6  160    110  3.9   2.88  17.0     0     1     4     4 rubbish  
 3 Datsun 710         22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 good     
 4 Hornet 4 Drive     21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 rubbish  
 5 Hornet Sportabout  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 rubbish  
 6 Valiant            18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 good     
 7 Duster 360         14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 rubbish  
 8 Merc 240D          24.4     4  147.    62  3.69  3.19  20       1     0     4     2 rubbish  
 9 Merc 230           22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 rubbish  
10 Merc 280           19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 rubbish  
# … with 22 more rows

Or another option is to use %in% and without any case_when it can be changed

is_good_car <- function(x){
  
    x %in% c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
    )
  
}

-testing

mtcars %>%
    rownames_to_column('car') %>%
    mutate(good_cars = c("rubbish", "good")[1   is_good_car(car)])

-output

                 car  mpg cyl  disp  hp drat    wt  qsec vs am gear carb good_cars
1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      good
2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   rubbish
3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1      good
4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   rubbish
5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   rubbish
6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      good
7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   rubbish
....

Also, if we need faster approach, use data.table

library(data.table)
mtcars1 <- copy(mtcars)
setDT(mtcars, keep.rownames = TRUE)[, good_cars := 'rubbish'
     ][rn %chin% c(  "Mazda RX4",
      "Datsun 710",
      "Valiant"), good_cars := 'good'][]