Home > Blockchain >  Use case_when in piped workflow over large number of criteria without rowwise
Use case_when in piped workflow over large number of criteria without rowwise

Time:09-22

I would like to use a piped workflow to apply dplyr::case_when over a dataset of 10^4s of rows regularly, and would like to avoid rowwise as it is v slow.

I would like to match a large number of criteria to a given values, e.g. any(x = c(1:999, 3000:200000, 250000:250100), where length(x) is 1, and apply it to each row in a data.frame.

Something like this function but with many more criteria:

is_good_car <- function(x){
  any(
    x == c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
    )
  )
}

I could apply it like this:

library(dplyr)

mtcars %>%
  mutate(
    car = rownames(.)
  ) %>%
  as_tibble %>%
  mutate(
    good_cars = case_when(
      is_good_car(car) ~ "good",
      TRUE ~ "rubbish"
    )
  ) %>%
  select(car, good_cars)
#> Warning in x == c("Mazda RX4", "Datsun 710", "Valiant"): longer object length is
#> not a multiple of shorter object length
#> # A tibble: 32 x 2
#>    car               good_cars
#>    <chr>             <chr>    
#>  1 Mazda RX4         good     
#>  2 Mazda RX4 Wag     good     
#>  3 Datsun 710        good     
#>  4 Hornet 4 Drive    good     
#>  5 Hornet Sportabout good     
#>  6 Valiant           good     
#>  7 Duster 360        good     
#>  8 Merc 240D         good     
#>  9 Merc 230          good     
#> 10 Merc 280          good     
#> # ... with 22 more rows

But this doesn't work because it just returns a single TRUE from is_good_car and returns this to every row.

I can use rowwise to get the right answer but it's v slow for my purpose:

mtcars %>%
  mutate(
    car = rownames(.)
  ) %>%
  as_tibble %>%
  rowwise %>%
  mutate(
    good_cars = case_when(
      is_good_car(car) ~ "good",
      TRUE ~ "rubbish"
    )
  ) %>%
  select(car, good_cars)
#> # A tibble: 32 x 2
#> # Rowwise: 
#>    car               good_cars
#>    <chr>             <chr>    
#>  1 Mazda RX4         good     
#>  2 Mazda RX4 Wag     rubbish  
#>  3 Datsun 710        good     
#>  4 Hornet 4 Drive    rubbish  
#>  5 Hornet Sportabout rubbish  
#>  6 Valiant           good     
#>  7 Duster 360        rubbish  
#>  8 Merc 240D         rubbish  
#>  9 Merc 230          rubbish  
#> 10 Merc 280          rubbish  
#> # ... with 22 more rows

I could use sapply also but I want to work it into a piped workflow like above:


sapply(
  X = rownames(mtcars),
  FUN = is_good_car
)
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>                TRUE               FALSE                TRUE               FALSE 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>               FALSE                TRUE               FALSE               FALSE 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>               FALSE               FALSE               FALSE               FALSE 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>               FALSE               FALSE               FALSE               FALSE 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>               FALSE               FALSE               FALSE               FALSE 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>               FALSE               FALSE               FALSE               FALSE 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>               FALSE               FALSE               FALSE               FALSE 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>               FALSE               FALSE               FALSE               FALSE

Are there any options to use function like is_good_car as desired, within case_when without using rowwise?

Created on 2021-09-22 by the reprex package (v2.0.1)

CodePudding user response:

For comparing multiple values you should use %in%.

is_good_car <- function(x){
    x %in% c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
  )
}

then you can use it without rowwise -

library(dplyr)

mtcars %>%
  mutate(
    car = rownames(.)
  ) %>%
  as_tibble %>%
  mutate(
    good_cars = case_when(is_good_car(car) ~ "good",
      TRUE ~ "rubbish"
    )
  ) %>%
  select(car, good_cars)

#   car               good_cars
#   <chr>             <chr>    
# 1 Mazda RX4         good     
# 2 Mazda RX4 Wag     rubbish  
# 3 Datsun 710        good     
# 4 Hornet 4 Drive    rubbish  
# 5 Hornet Sportabout rubbish  
# 6 Valiant           good     
# 7 Duster 360        rubbish  
# 8 Merc 240D         rubbish  
# 9 Merc 230          rubbish  
#10 Merc 280          rubbish  
# … with 22 more rows

CodePudding user response:

If you change == to %in% like so:

is_good_car <- function(x){
  any(
    x %in% c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
    )
  )
}

The function is now vectorized and will be fast and not need rowwise.

CodePudding user response:

We could use map and it should be faster than rowwise

library(purrr)
library(tibble)
library(dplyr)
mtcars %>%
    rownames_to_column('car') %>%
     mutate(good_cars = case_when(map_lgl(car, is_good_car) ~ 
          'good', TRUE ~ 'rubbish'))  %>%
     as_tibble

-output

# A tibble: 32 x 13
   car                 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb good_cars
   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>    
 1 Mazda RX4          21       6  160    110  3.9   2.62  16.5     0     1     4     4 good     
 2 Mazda RX4 Wag      21       6  160    110  3.9   2.88  17.0     0     1     4     4 rubbish  
 3 Datsun 710         22.8     4  108     93  3.85  2.32  18.6     1     1     4     1 good     
 4 Hornet 4 Drive     21.4     6  258    110  3.08  3.22  19.4     1     0     3     1 rubbish  
 5 Hornet Sportabout  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2 rubbish  
 6 Valiant            18.1     6  225    105  2.76  3.46  20.2     1     0     3     1 good     
 7 Duster 360         14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 rubbish  
 8 Merc 240D          24.4     4  147.    62  3.69  3.19  20       1     0     4     2 rubbish  
 9 Merc 230           22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2 rubbish  
10 Merc 280           19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4 rubbish  
# … with 22 more rows

Or another option is to use %in% and without any case_when it can be changed

is_good_car <- function(x){
  
    x %in% c(
      "Mazda RX4",
      "Datsun 710",
      "Valiant"
    )
  
}

-testing

mtcars %>%
    rownames_to_column('car') %>%
    mutate(good_cars = c("rubbish", "good")[1   is_good_car(car)])

-output

                 car  mpg cyl  disp  hp drat    wt  qsec vs am gear carb good_cars
1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      good
2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   rubbish
3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1      good
4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   rubbish
5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   rubbish
6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      good
7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   rubbish
....

Also, if we need faster approach, use data.table

library(data.table)
mtcars1 <- copy(mtcars)
setDT(mtcars, keep.rownames = TRUE)[, good_cars := 'rubbish'
     ][rn %chin% c(  "Mazda RX4",
      "Datsun 710",
      "Valiant"), good_cars := 'good'][]
  • Related