I would like to use a piped workflow to apply dplyr::case_when
over a dataset of 10^4s of rows regularly, and would like to avoid rowwise
as it is v slow.
I would like to match a large number of criteria to a given values, e.g. any(x = c(1:999, 3000:200000, 250000:250100)
, where length(x)
is 1, and apply it to each row in a data.frame
.
Something like this function but with many more criteria:
is_good_car <- function(x){
any(
x == c(
"Mazda RX4",
"Datsun 710",
"Valiant"
)
)
}
I could apply it like this:
library(dplyr)
mtcars %>%
mutate(
car = rownames(.)
) %>%
as_tibble %>%
mutate(
good_cars = case_when(
is_good_car(car) ~ "good",
TRUE ~ "rubbish"
)
) %>%
select(car, good_cars)
#> Warning in x == c("Mazda RX4", "Datsun 710", "Valiant"): longer object length is
#> not a multiple of shorter object length
#> # A tibble: 32 x 2
#> car good_cars
#> <chr> <chr>
#> 1 Mazda RX4 good
#> 2 Mazda RX4 Wag good
#> 3 Datsun 710 good
#> 4 Hornet 4 Drive good
#> 5 Hornet Sportabout good
#> 6 Valiant good
#> 7 Duster 360 good
#> 8 Merc 240D good
#> 9 Merc 230 good
#> 10 Merc 280 good
#> # ... with 22 more rows
But this doesn't work because it just returns a single TRUE
from is_good_car
and returns this to every row.
I can use rowwise
to get the right answer but it's v slow for my purpose:
mtcars %>%
mutate(
car = rownames(.)
) %>%
as_tibble %>%
rowwise %>%
mutate(
good_cars = case_when(
is_good_car(car) ~ "good",
TRUE ~ "rubbish"
)
) %>%
select(car, good_cars)
#> # A tibble: 32 x 2
#> # Rowwise:
#> car good_cars
#> <chr> <chr>
#> 1 Mazda RX4 good
#> 2 Mazda RX4 Wag rubbish
#> 3 Datsun 710 good
#> 4 Hornet 4 Drive rubbish
#> 5 Hornet Sportabout rubbish
#> 6 Valiant good
#> 7 Duster 360 rubbish
#> 8 Merc 240D rubbish
#> 9 Merc 230 rubbish
#> 10 Merc 280 rubbish
#> # ... with 22 more rows
I could use sapply
also but I want to work it into a piped workflow like above:
sapply(
X = rownames(mtcars),
FUN = is_good_car
)
#> Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
#> TRUE FALSE TRUE FALSE
#> Hornet Sportabout Valiant Duster 360 Merc 240D
#> FALSE TRUE FALSE FALSE
#> Merc 230 Merc 280 Merc 280C Merc 450SE
#> FALSE FALSE FALSE FALSE
#> Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
#> FALSE FALSE FALSE FALSE
#> Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
#> FALSE FALSE FALSE FALSE
#> Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
#> FALSE FALSE FALSE FALSE
#> Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
#> FALSE FALSE FALSE FALSE
#> Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
#> FALSE FALSE FALSE FALSE
Are there any options to use function like is_good_car
as desired, within case_when
without using rowwise
?
Created on 2021-09-22 by the reprex package (v2.0.1)
CodePudding user response:
For comparing multiple values you should use %in%
.
is_good_car <- function(x){
x %in% c(
"Mazda RX4",
"Datsun 710",
"Valiant"
)
}
then you can use it without rowwise
-
library(dplyr)
mtcars %>%
mutate(
car = rownames(.)
) %>%
as_tibble %>%
mutate(
good_cars = case_when(is_good_car(car) ~ "good",
TRUE ~ "rubbish"
)
) %>%
select(car, good_cars)
# car good_cars
# <chr> <chr>
# 1 Mazda RX4 good
# 2 Mazda RX4 Wag rubbish
# 3 Datsun 710 good
# 4 Hornet 4 Drive rubbish
# 5 Hornet Sportabout rubbish
# 6 Valiant good
# 7 Duster 360 rubbish
# 8 Merc 240D rubbish
# 9 Merc 230 rubbish
#10 Merc 280 rubbish
# … with 22 more rows
CodePudding user response:
If you change ==
to %in%
like so:
is_good_car <- function(x){
any(
x %in% c(
"Mazda RX4",
"Datsun 710",
"Valiant"
)
)
}
The function is now vectorized and will be fast and not need rowwise
.
CodePudding user response:
We could use map
and it should be faster than rowwise
library(purrr)
library(tibble)
library(dplyr)
mtcars %>%
rownames_to_column('car') %>%
mutate(good_cars = case_when(map_lgl(car, is_good_car) ~
'good', TRUE ~ 'rubbish')) %>%
as_tibble
-output
# A tibble: 32 x 13
car mpg cyl disp hp drat wt qsec vs am gear carb good_cars
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 good
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4 rubbish
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 good
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 rubbish
5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 rubbish
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 good
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 rubbish
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 rubbish
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 rubbish
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 rubbish
# … with 22 more rows
Or another option is to use %in%
and without any case_when
it can be changed
is_good_car <- function(x){
x %in% c(
"Mazda RX4",
"Datsun 710",
"Valiant"
)
}
-testing
mtcars %>%
rownames_to_column('car') %>%
mutate(good_cars = c("rubbish", "good")[1 is_good_car(car)])
-output
car mpg cyl disp hp drat wt qsec vs am gear carb good_cars
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 good
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 rubbish
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 good
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 rubbish
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 rubbish
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 good
7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 rubbish
....
Also, if we need faster approach, use data.table
library(data.table)
mtcars1 <- copy(mtcars)
setDT(mtcars, keep.rownames = TRUE)[, good_cars := 'rubbish'
][rn %chin% c( "Mazda RX4",
"Datsun 710",
"Valiant"), good_cars := 'good'][]